from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession \
.builder \
.master('local') \
.appName('my_pyspark_app') \
.getOrCreate()
data = [
('kim', 'a', 100),
('kim', 'a', 90),
('lee', 'a', 80),
('lee', 'b', 70),
('park', 'b', 60)
]
schema = StructType([ \
StructField('name', StringType(), True), \
StructField('class', StringType(), True), \
StructField('score', IntegerType(), True)
])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show()
df_groupby = df.groupBy('name').sum()
df_groupby.printSchema()
df_groupby.show()
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession \
.builder \
.master('local') \
.appName('my_pyspark_app') \
.getOrCreate()
data = [
('kim', 'a', 100),
('kim', 'a', 90),
('lee', 'a', 80),
('lee', 'b', 70),
('park', 'b', 60)
]
schema = StructType([ \
StructField('name', StringType(), True), \
StructField('class', StringType(), True), \
StructField('score', IntegerType(), True)
])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show()
df_groupby = df.groupBy('class').sum()
df_groupby.printSchema()
df_groupby.show()
두개 이상을 groupby 하는 방법
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession \
.builder \
.master('local') \
.appName('my_pyspark_app') \
.getOrCreate()
data = [
('kim', 'a', 100),
('kim', 'a', 90),
('lee', 'a', 80),
('lee', 'b', 70),
('park', 'b', 60)
]
schema = StructType([ \
StructField('name', StringType(), True), \
StructField('class', StringType(), True), \
StructField('score', IntegerType(), True)
])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show()
df_groupby = df.groupBy('name', 'class').sum()
df_groupby.printSchema()
df_groupby.show()
df_groupby = df.groupBy('class', 'name').sum()
df_groupby.printSchema()
df_groupby.show()
'Data Engineering > Spark' 카테고리의 다른 글
[Spark] pyspark dataframe 특정 컬럼(열)만 출력하는 방법 (0) | 2023.01.14 |
---|---|
[Spark] pyspark dataframe 컬럼을 이용해 연산하는 방법 (0) | 2023.01.14 |
[Spark] List로 pyspark dataframe 만드는 방법 (0) | 2023.01.14 |
[Spark] Row 함수를 이용해서 Pyspark dataframe 만드는 방법 (0) | 2023.01.14 |
[Spark] pandas dataframe을 pyspark dataframe로 변환하는 방법 (0) | 2023.01.14 |