Data Engineering/Spark

Data Engineering/Spark

[Spark] 스파크 데이터프레임 첫번째 행만 출력하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import max, avg, sum, min spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', sc..

Data Engineering/Spark

[Spark] 스파크 데이터프레임 groupBy() 동시에 컬럼명 수정하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import max, avg, sum, min spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', sc..

Data Engineering/Spark

[Spark] 스파크 데이터프레임을 agg() 이용해서 groupBy() 하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import max, avg, sum, min spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', sc..

Data Engineering/Spark

[Spark] 스파크 데이터프레임 groupBy 사용하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', score = 80, year = 2014), Row(name = 'b', age = 21, typ..

Data Engineering/Spark

[Spark] 스파크 데이터프레임 필터를 두번 적용하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', score = 80, year = 2014), Row(name = 'b', age = 21, typ..

Data Engineering/Spark

[Spark] 스파크 데이터프레임에 필터를 적용해 원하는 행을 추출하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', score = 80, year = 2014), Row(name = 'b', age = 21, typ..

Data Engineering/Spark

[Spark] 스파크 데이터프레임 원하는 컬럼 출력하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', score = 80, year = 2014), Row(name = 'b', age = 21, typ..

Data Engineering/Spark

[Spark] 스파크 데이터프레임 전체 데이터 출력하는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession\ .builder\ .appName("1_test_dataframe")\ .getOrCreate() sc = spark.sparkContext data = [Row(name = 'a', age = 12, type = 'A', score = 90, year = 2012), Row(name = 'a', age = 15, type = 'B', score = 80, year = 2013), Row(name = 'b', age = 15, type = 'B', score = 80, year = 2014), Row(name = 'b', age = 21, typ..

Data Engineering/Spark

[Spark] RDD의 데이터를 소문자, 대문자로 만드는 방법

코드 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = 'i love you' line_2 = 'you are my friend' line_3 = 'my name is park' lines = sc.parallelize([line_1.upper(), line_2.upper(), line_3.upper()]) lines_map = lines.map(lambda x: x.lower().split(' ')) lines_flatmap = lines.flatMap(lambda x: x.lower().split('..

Data Engineering/Spark

[Spark] map 함수 vs flatMap 함수

코드 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = 'i love you' line_2 = 'you are my friend' line_3 = 'my name is park' lines = sc.parallelize([line_1, line_2, line_3]) lines_map = lines.map(lambda x: x.split(' ')) lines_flatmap = lines.flatMap(lambda x: x.split(' ')) print(f'lines.collect() : {lines.co..

박경태
'Data Engineering/Spark' 카테고리의 글 목록 (6 Page)