from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('0_test_rdd') \
.master('spark://spark-master-1:7077,spark-master-2:7077') \
.config('spark.driver.cores', '2') \
.config('spark.driver.memory','2g') \
.config('spark.executor.memory', '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '8') \
.getOrCreate()
sc = spark.sparkContext
data_1 = list(range(0, 10))
data_2 = list(range(10, 20))
data_3 = list(range(20, 30))
rdd_1 = sc.parallelize(data_1)
rdd_2 = sc.parallelize(data_2)
rdd_3 = sc.parallelize(data_3)
rdd_union = rdd_1.union(rdd_2).union(rdd_3)
rdd_union_sub = rdd_3.union(rdd_2).union(rdd_1)
print()
print(rdd_1)
print(rdd_1.count(), rdd_1.collect())
print()
print(rdd_2)
print(rdd_2.count(), rdd_2.collect())
print()
print(rdd_3)
print(rdd_3.count(), rdd_3.collect())
print()
print(rdd_union)
print(rdd_union.count(), rdd_union.collect())
print()
print(rdd_union_sub)
print(rdd_union_sub.count(), rdd_union_sub.collect())
print()
sc.stop()
ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274
10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274
10 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:274
10 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
UnionRDD[4] at union at NativeMethodAccessorImpl.java:0
30 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
UnionRDD[6] at union at NativeMethodAccessorImpl.java:0
30 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
'Data Engineering > Spark' 카테고리의 다른 글
[Spark] Docker, failed: port is already allocated (0) | 2022.06.05 |
---|---|
[Spark] pyspark RDD parallelize(number) union() map() (0) | 2022.06.05 |
[Spark] pyspark RDD parallelize() number (0) | 2022.06.05 |
[Spark] pyspark RDD count(), collect() (0) | 2022.06.05 |
[Spark] pyspark RDD parallelize(), flatMap(), filter() (0) | 2022.06.05 |