from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('0_test_rdd') \
.master('spark://spark-master-1:7077,spark-master-2:7077') \
.config('spark.driver.cores', '2') \
.config('spark.driver.memory','2g') \
.config('spark.executor.memory', '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '8') \
.getOrCreate()
sc = spark.sparkContext
data_1 = list(range(0, 10))
data_2 = list(range(10, 20))
data_3 = list(range(20, 30))
rdd_1 = sc.parallelize(data_1)
rdd_2 = sc.parallelize(data_2)
rdd_3 = sc.parallelize(data_3)
rdd_union = rdd_1.union(rdd_2).union(rdd_3)
rdd_map = rdd_union.map(lambda x: x * 2)
rdd_map_2 = rdd_union.map(lambda x: x * x)
print()
print(rdd_union)
print(rdd_union.count(), rdd_union.collect())
print()
print(rdd_map)
print(rdd_map.count(), rdd_map.collect())
print()
print(rdd_map_2)
print(rdd_map_2.count(), rdd_map_2.collect())
print()
sc.stop()
UnionRDD[4] at union at NativeMethodAccessorImpl.java:0
30 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
PythonRDD[6] at RDD at PythonRDD.scala:53
30 [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58]
PythonRDD[8] at RDD at PythonRDD.scala:53
30 [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841]
'Data Engineering > Spark' 카테고리의 다른 글
[spark] spark-prometheus-grafana 대시보드 정리 (0) | 2022.07.03 |
---|---|
[Spark] Docker, failed: port is already allocated (0) | 2022.06.05 |
[Spark] pyspark RDD parallelize() number and union() (0) | 2022.06.05 |
[Spark] pyspark RDD parallelize() number (0) | 2022.06.05 |
[Spark] pyspark RDD count(), collect() (0) | 2022.06.05 |