Data Engineering

Data Engineering/Spark

[Spark] RDD의 데이터를 소문자, 대문자로 만드는 방법

코드 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = 'i love you' line_2 = 'you are my friend' line_3 = 'my name is park' lines = sc.parallelize([line_1.upper(), line_2.upper(), line_3.upper()]) lines_map = lines.map(lambda x: x.lower().split(' ')) lines_flatmap = lines.flatMap(lambda x: x.lower().split('..

Data Engineering/Spark

[Spark] map 함수 vs flatMap 함수

코드 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = 'i love you' line_2 = 'you are my friend' line_3 = 'my name is park' lines = sc.parallelize([line_1, line_2, line_3]) lines_map = lines.map(lambda x: x.split(' ')) lines_flatmap = lines.flatMap(lambda x: x.split(' ')) print(f'lines.collect() : {lines.co..

Data Engineering/Spark

[Spark] 10개의 숫자를 map 하고, reduce 처리 하는 방법

코드 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext data = sc.parallelize(range(0, 10)) data_map = data.map(lambda x: x * x) data_reduce = data.reduce(lambda x, y: x + y) data_map_reduce = data_map.reduce(lambda x, y: x + y) print(f'data : {data.collect()}') print(f'data_map : {data_map.collect()}') print(f'data_..

Data Engineering/Spark

[Spark] 판다스 데이터프레임을 스파크 데이터프레임으로 변경하는 방법

코드 from pyspark.sql import SparkSession import numpy as np import pandas as pd spark = SparkSession \ .builder \ .appName("1_test_dataframe") \ .getOrCreate() df_pandas = pd.DataFrame(np.random.rand(100, 3)) df_spark = spark.createDataFrame(df_pandas) df_re_pandas = df_spark.select("*").toPandas() print(df_pandas.head(5)) print(df_spark.show(5)) print(df_re_pandas.head(5)) spark.stop() 결과

Data Engineering/Spark

[Spark] csv 파일을 읽어서 스파크 데이터프레임으로 만드는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() schema = StructType([StructField("first", StringType(), True), StructField("second", StringType(), True), StructField("third", StringType(), True)]) df = spark.read.csv('/home/spark/result/1_test_dataframe.csv', header =..

Data Engineering/Spark

[Spark] 간단한 DataFrame 만드는 방법

코드 from pyspark.sql import SparkSession from pyspark.sql import Row if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext data_1 = [("kim", 1), ("park", 2), ("choi", 3)] data_2 = [Row(name='kim', age=5, height=80), Row(name='park', age=5, height=80), Row(name='choi', age=10, height=80)] df_1 = sc.parallelize(data_1).toDF() df_2..

Data Engineering/Spark

[Spark] RDD를 map한 결과를 각각의 키로 합계내는 방법

코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext lines = sc.parallelize(['i love you', 'you are my friend', 'my name is park']) print(f'lines : {lines.collect()}' ) # pairs = lines.map(lambda s: s.split(" ")) # 띄어쓰기로 나누기 # pairs = pairs.filter(lambda x: len(x) > 3) # 단어의 개수가 3개를 초과하는..

Data Engineering/Spark

[Spark] RDD의 문장들을 띄어쓰기로 단어를 나누고 카운트하는 방법

코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext lines = sc.parallelize(['i love you', 'you are my friend', 'my name is park']) print(f'lines : {lines.collect()}' ) # pairs = lines.map(lambda s: s.split(" ")) # pairs = pairs.filter(lambda x: len(x) > 3) pairs = lines.flatMap(lambda x..

Data Engineering/Spark

[Spark] RDD를 필터처리하는 방법

코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext lines = sc.parallelize(['i love you', 'you are my friend', 'my name is park']) print(lines.collect()) pairs = lines.map(lambda s: s.split(" ")) pairs = pairs.filter(lambda x: len(x) > 3) print(pairs.collect()) spark.stop() 결과

Data Engineering/Spark

[Spark] RDD에 있는 문장을 띄어쓰기로 나누는 방법

코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext lines = sc.parallelize(['i love you', 'you are my friend', 'my name is park']) print(lines.collect()) pairs = lines.map(lambda s: s.split(" ")) print(pairs.collect()) spark.stop() 결과

박경태
'Data Engineering' 카테고리의 글 목록 (16 Page)