다음과 같은 에러가 발생
Traceback (most recent call last):
File "df_schema_null.py", line 23, in <module>
df = spark.createDataFrame(data = data, schema = schema)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/session.py", line 894, in createDataFrame
return self._create_dataframe(
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/session.py", line 936, in _create_dataframe
rdd, struct = self._createFromLocal(map(prepare, data), schema)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/session.py", line 628, in _createFromLocal
data = list(data)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/session.py", line 910, in prepare
verify_func(obj)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/types.py", line 1722, in verify
verify_value(obj)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/types.py", line 1700, in verify_struct
verifier(v)
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/types.py", line 1721, in verify
if not verify_nullability(obj):
File "/Users/pgt0409/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pyspark/sql/types.py", line 1578, in verify_nullability
raise ValueError(new_msg("This field is not nullable, but got None"))
ValueError: field score: This field is not nullable, but got None
코드를 보자
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession \
.builder \
.master('local') \
.appName('my_pyspark_app') \
.getOrCreate()
data = [
('kim', 100),
('kim', 90),
('lee', 80),
('lee', 70),
('park', None)
]
schema = StructType([
StructField('name', StringType(), True),
StructField('score', IntegerType(), False)
])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show()
schema 에서 null 값을 허용하지 않는데, null 값이 들어가서 문제가됐다.
null을 허용하도록 다음과 같이 수정해보자
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession \
.builder \
.master('local') \
.appName('my_pyspark_app') \
.getOrCreate()
data = [
('kim', 100),
('kim', 90),
('lee', 80),
('lee', 70),
('park', None)
]
schema = StructType([
StructField('name', StringType(), True),
StructField('score', IntegerType(), True)
])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show()
'Data Engineering > Spark' 카테고리의 다른 글
[Spark] 스파크란 무엇인가? (0) | 2023.04.21 |
---|---|
[Spark] conda로 pyspark 환경 구축하기 (0) | 2023.01.14 |
[Spark] pyspark dataframe 생성시 schema data type 설정 방법 (0) | 2023.01.14 |
[Spark] pyspark dataframe 의 특정 열을 list로 만드는 방법 (0) | 2023.01.14 |
[Spark] pyspark dataframe을 리스트로 만드는 가장 좋고 빠른 방법 (0) | 2023.01.14 |