from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('0_test_rdd') \
.master('spark://spark-master-1:7077,spark-master-2:7077') \
.config('spark.driver.cores', '2') \
.config('spark.driver.memory','2g') \
.config('spark.executor.memory', '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '8') \
.getOrCreate()
sc = spark.sparkContext
line_1 = 'i love you'
line_2 = 'you are my friend'
line_3 = 'my name is park'
lines = sc.parallelize([line_1, line_2, line_3])
lines_flatmap = lines.flatMap(lambda x: x.split(' '))
lines_filter = lines_flatmap.filter(lambda x: len(x) > 3)
print()
print(lines)
print(lines.collect())
print()
print(lines_flatmap)
print(lines_flatmap.collect())
print()
print(lines_filter)
print(lines_filter.collect())
print()
sc.stop()
ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274
['i love you', 'you are my friend', 'my name is park']
PythonRDD[1] at RDD at PythonRDD.scala:53
['i', 'love', 'you', 'you', 'are', 'my', 'friend', 'my', 'name', 'is', 'park']
PythonRDD[2] at RDD at PythonRDD.scala:53
['love', 'friend', 'name', 'park']
'Data Engineering > Spark' 카테고리의 다른 글
[Spark] pyspark RDD parallelize() number (0) | 2022.06.05 |
---|---|
[Spark] pyspark RDD count(), collect() (0) | 2022.06.05 |
[Spark] pyspark RDD parallelize(), map(), flatMap() (0) | 2022.06.05 |
[Spark] pyspark RDD upper(), lower() (0) | 2022.06.05 |
[Spark] spark cluster + zookeeper 고가용성 테스트 (0) | 2022.06.05 |