Data Engineering/Spark
2022.05.28
코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = sc.parallelize(['0', '1', '2', '3', '4']) line_2 = sc.parallelize(['5', '6', '7', '8', '9']) line_3 = sc.parallelize(['10', '11', '12', '13', '14']) line_all = line_1.union(line_2).union(line_3) line_filter = line_all.filter(l..
Data Engineering/Spark
2022.05.28
코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() sc = spark.sparkContext line_1 = sc.parallelize(['0', '1', '2', '3', '4']) line_2 = sc.parallelize(['5', '6', '7', '8', '9']) line_3 = sc.parallelize(['10', '11', '12', '13', '14']) line_all = line_1.union(line_2).union(line_3) line_filter = line_all.filter(l..
Data Engineering/Spark
2022.05.28
코드 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() line_1 = spark.sparkContext.parallelize(['0', '1', '2', '3', '4']) line_2 = spark.sparkContext.parallelize(['5', '6', '7', '8', '9']) line_3 = spark.sparkContext.parallelize(['10', '11', '12', '13', '14']) line_all = line_1.union(line_2).union(line_3) print('..
Data Engineering/Spark
2022.05.28
내 코드를 보자 from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("0_save_file")\ .getOrCreate() alphabet_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] alphabet_rdd = spark.sparkContext.parallelize(alphabet_list) number_rdd = spark.sparkContext.parallelize(r..
Data Engineering/Spark
2022.05.28
스파크 테스트 중 다음과 같은 에러가 발생 22/05/28 11:51:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Traceback (most recent call last): File "0_save_file.py", line 24, in alphabet_rdd.saveAsTextFile("/home/spark/result/0_save_file") File "/usr/local/lib/python3.8/dist-packages/pyspark/rdd.py", line 1828, in saveAsTextFile keyed._j..
Data Engineering/Airflow
2022.05.04
3_bash_operator_echo.py from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime default_args = { 'owner' : 'ParkGyeongTae' } dag = DAG ( dag_id = '3_bash_operator_echo', start_date = datetime(2022, 5, 4), schedule_interval = '* * * * *', catchup = False, tags = ['test'], description = 'Bash Operator Sample', default_args = default_args ) ech..
Data Engineering/Airflow
2022.05.04
2_bash_operator.py from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime default_args = { 'owner' : 'ParkGyeongTae' } dag = DAG ( dag_id = '2_bash_operator', start_date = datetime(2022, 5, 4), schedule_interval = '* * * * *', catchup = False, tags = ['test'], description = 'Bash Operator Sample', default_args = default_args ) sleep_1 = Bas..
Data Engineering/Airflow
2022.05.04
1_python_operator.py from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime default_args = { 'owner' : 'ParkGyeongTae' } dag = DAG ( dag_id = '1_python_operator', start_date = datetime(2022, 5, 4), schedule_interval = '* * * * *', catchup = False, tags = ['test'], description = 'Python Operator Sample', default_args = default_args ) def..
Data Engineering/Airflow
2022.04.18
https://github.com/ParkGyeongTae/airflow-pgt/tree/main/0_airflow GitHub - ParkGyeongTae/airflow-pgt Contribute to ParkGyeongTae/airflow-pgt development by creating an account on GitHub. github.com sudo -u postgres psql -U postgres -c "\list" sudo -u postgres psql -U postgres -d airflow -c "\list" sudo -u postgres psql -U postgres -d airflow -c "\dt" sudo -u postgres psql -U postgres -d airflow -..
Data Engineering/Airflow
2022.04.18
https://github.com/ParkGyeongTae/airflow-pgt/tree/main/0_airflow GitHub - ParkGyeongTae/airflow-pgt Contribute to ParkGyeongTae/airflow-pgt development by creating an account on GitHub. github.com from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime dag = DAG ( dag_id = 'my_bash_dag', start_date = datetime(2022, 4, 16), schedule_interval ..