Data Engineering/Spark

Data Engineering/Spark

[Spark] Pyspark 간단한 StructType 사용하는 방법

data = { 'category': 'category_1', 'id': 'id_1' } df = spark.createDataFrame([data]) df.printSchema() df.show() schema = StructType([ StructField('category', StringType()), StructField('id', StringType()) ]) df = spark.createDataFrame([data], schema) df.printSchema() df.show()

Data Engineering/Spark

[Spark] 아파치 스파크의 실행 과정 요약

1. 사용자는 Spark를 실행 2. Spark-Submit 으로 애플리케이션 제출 3. Spark Driver Process 가 Main()을 실행 4. SparkContext를 생성 5. SparkContext와 Spark Cluster Manager 연결 6. Spark Driver Process 가 Spark Cluster Manager 로 부터 Executor 실행을 위한 리소스 요청 7. Spark Context는 작업 내용을 task 단위로 분할하여 Executor 에 전송 8. 각 Executor는 작업을 수행 9. 결과를 저장

Data Engineering/Spark

[spark] spark-prometheus-grafana 대시보드 정리

Step 1) Spark의 conf 폴더에서 metrics.properties 파일 생성 후 다음 내용 작성 *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet *.sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus Step 2) spark-submit 실행시 다음 명령어 추가 --conf spark.ui.prometheus.enabled=tr..

Data Engineering/Spark

[Spark] Docker, failed: port is already allocated

Creating zookeeper-navigator ... done Creating spark-master-1 ... done Creating spark-master-2 ... done WARNING: The "spark-slave" service specifies a port on the host. If multiple containers for this service are created on a single host, the port will clash. Creating zeppelin ... Creating 3_spark-cluster-zookeeper_spark-slave_1 ... Creating 3_spark-cluster-zookeeper_spark-slave_1 ... error Crea..

Data Engineering/Spark

[Spark] pyspark RDD parallelize(number) union() map()

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext data_1 = list(ra..

Data Engineering/Spark

[Spark] pyspark RDD parallelize() number and union()

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext data_1 = list(ra..

Data Engineering/Spark

[Spark] pyspark RDD parallelize() number

code from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext data_1 = li..

Data Engineering/Spark

[Spark] pyspark RDD count(), collect()

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext line_1 = 'i love..

Data Engineering/Spark

[Spark] pyspark RDD parallelize(), flatMap(), filter()

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext line_1 = 'i love..

Data Engineering/Spark

[Spark] pyspark RDD parallelize(), map(), flatMap()

code from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('0_test_rdd') \ .master('spark://spark-master-1:7077,spark-master-2:7077') \ .config('spark.driver.cores', '2') \ .config('spark.driver.memory','2g') \ .config('spark.executor.memory', '2g') \ .config('spark.executor.cores', '2') \ .config('spark.cores.max', '8') \ .getOrCreate() sc = spark.sparkContext line_1 = 'i..

박경태
'Data Engineering/Spark' 카테고리의 글 목록 (3 Page)