Spark DDL 架构 JSON 结构
问题
我试图在 pyspark 中定义嵌套 .json 模式,但无法使 ddl_schema 字符串正常工作。
通常在 SQL 中这将是 ROW,我已尝试下面的 STRUCT 但无法获得正确的数据类型,这是错误...
ParseException:
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)
== SQL ==
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
---------------^^^
dob DATE,
nationality STRING,
url STRING
数据示例
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
|driverId| driverRef|number|code| name| dob|nationality| url|
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
| 1| hamilton| 44| HAM| {Lewis, Hamilton}|1985-01-07| British|http://en.wikiped...|
代码示例
mnt = "/mnt/dev/root"
env = "raw"
path = "formula1/drivers"
fileFormat = "json"
inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
options = {'header': 'True'}
ddl_schema = """
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
dob DATE,
nationality STRING,
url STRING
"""
drivers_df = (spark
.read
.options(**options)
.schema(ddl_schema)
.format(fileFormat)
.load(inPath)
)
Question
I am trying to define a nested .json schema in pyspark, but cannot get the ddl_schema string to work.
Usually in SQL this would be ROW, I have tried STRUCT below but can't get the data type correct this is the error...
ParseException:
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)
== SQL ==
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
---------------^^^
dob DATE,
nationality STRING,
url STRING
Data Sample
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
|driverId| driverRef|number|code| name| dob|nationality| url|
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
| 1| hamilton| 44| HAM| {Lewis, Hamilton}|1985-01-07| British|http://en.wikiped...|
Code Sample
mnt = "/mnt/dev/root"
env = "raw"
path = "formula1/drivers"
fileFormat = "json"
inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
options = {'header': 'True'}
ddl_schema = """
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
dob DATE,
nationality STRING,
url STRING
"""
drivers_df = (spark
.read
.options(**options)
.schema(ddl_schema)
.format(fileFormat)
.load(inPath)
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您使用的 STRUCT 语法错误。
这是正确的:
https://spark.apache.org/docs /latest/sql-ref-datatypes.html
(搜索
复杂类型
并选择 SQL 选项卡)You are using the wrong syntax for STRUCT.
Here is the right one:
https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(search for
Complex types
and choose the SQL tab)