Spark DDL 架构 JSON 结构

发布于 2025-01-10 07:00:41 字数 1841 浏览 3 评论 0原文

问题

我试图在 pyspark 中定义嵌套 .json 模式,但无法使 ddl_schema 字符串正常工作。

通常在 SQL 中这将是 ROW,我已尝试下面的 STRUCT 但无法获得正确的数据类型,这是错误...

ParseException: 
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)

== SQL ==

    driverId INT,
    driverRef STRING,
    number STRING,
    code STRING,
    name STRUCT(forename STRING, surname STRING),
---------------^^^
    dob DATE,
    nationality STRING,
    url STRING

数据示例

            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |driverId| driverRef|number|code|                name|       dob|nationality|                 url|
            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |       1|  hamilton|    44| HAM|   {Lewis, Hamilton}|1985-01-07|    British|http://en.wikiped...|

代码示例

        mnt = "/mnt/dev/root"
        env = "raw"
        path = "formula1/drivers"
        fileFormat = "json"
        
        inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
        
        
        options = {'header': 'True'}
        
        ddl_schema = """
            driverId INT,
            driverRef STRING,
            number STRING,
            code STRING,
            name STRUCT(forename STRING, surname STRING),
            dob DATE,
            nationality STRING,
            url STRING
        """
        
        drivers_df = (spark
                       .read
                       .options(**options)
                       .schema(ddl_schema)
                       .format(fileFormat)
                       .load(inPath)
                     )

Question

I am trying to define a nested .json schema in pyspark, but cannot get the ddl_schema string to work.

Usually in SQL this would be ROW, I have tried STRUCT below but can't get the data type correct this is the error...

ParseException: 
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)

== SQL ==

    driverId INT,
    driverRef STRING,
    number STRING,
    code STRING,
    name STRUCT(forename STRING, surname STRING),
---------------^^^
    dob DATE,
    nationality STRING,
    url STRING

Data Sample

            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |driverId| driverRef|number|code|                name|       dob|nationality|                 url|
            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |       1|  hamilton|    44| HAM|   {Lewis, Hamilton}|1985-01-07|    British|http://en.wikiped...|

Code Sample

        mnt = "/mnt/dev/root"
        env = "raw"
        path = "formula1/drivers"
        fileFormat = "json"
        
        inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
        
        
        options = {'header': 'True'}
        
        ddl_schema = """
            driverId INT,
            driverRef STRING,
            number STRING,
            code STRING,
            name STRUCT(forename STRING, surname STRING),
            dob DATE,
            nationality STRING,
            url STRING
        """
        
        drivers_df = (spark
                       .read
                       .options(**options)
                       .schema(ddl_schema)
                       .format(fileFormat)
                       .load(inPath)
                     )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蘑菇王子 2025-01-17 07:00:41

您使用的 STRUCT 语法错误。
这是正确的:

name STRUCT<forename:STRING,surname:STRING>

https://spark.apache.org/docs /latest/sql-ref-datatypes.html
(搜索复杂类型并选择 SQL 选项卡)

数据类型SQL 名称
BooleanTypeBOOLEAN
ByteTypeBYTE、TINYINT
ShortTypeSHORT、SMALLINT
IntegerTypeINT、INTEGER
LongTypeLONG、BIGINT
FloatTypeFLOAT、REAL
DoubleTypeDOUBLE
DateTypeDATE
TimestampTypeTIMESTAMP
StringTypeSTRING
BinaryTypeBINARY
DecimalTypeDECIMAL, DEC、NUMERIC
YearMonthIntervalTypeINTERVAL YEAR、INTERVAL 年到月、INTERVAL MONTH
DayTimeIntervalTypeINTERVAL DAY、INTERVAL DAY 到 HOUR、INTERVAL DAY 到 MINUTE、INTERVAL DAY 到 SECOND、INTERVAL 小时、INTERVAL 小时到分钟、INTERVAL 小时到秒、INTERVAL MINUTE、INTERVAL分钟到第二,间隔第二
ArrayTypeARRAY
StructTypeSTRUCT注意:“:”是可选的。
MapTypeMAP

You are using the wrong syntax for STRUCT.
Here is the right one:

name STRUCT<forename:STRING,surname:STRING>

https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(search for Complex types and choose the SQL tab)

Data typeSQL name
BooleanTypeBOOLEAN
ByteTypeBYTE, TINYINT
ShortTypeSHORT, SMALLINT
IntegerTypeINT, INTEGER
LongTypeLONG, BIGINT
FloatTypeFLOAT, REAL
DoubleTypeDOUBLE
DateTypeDATE
TimestampTypeTIMESTAMP
StringTypeSTRING
BinaryTypeBINARY
DecimalTypeDECIMAL, DEC, NUMERIC
YearMonthIntervalTypeINTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
DayTimeIntervalTypeINTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
ArrayTypeARRAY<element_type>
StructTypeSTRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional.
MapTypeMAP<key_type, value_type>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文