pyspark from_json失败了错误:不能以JSON格式解析架构:未识别的令牌' array':warde tresgess(json string,number,array,array)

发布于 2025-02-05 06:59:34 字数 1910 浏览 2 评论 0 原文

parquet_path =/tmp/test-parquet

t2.json的内容是:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

t2.json 创建dataframe

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

:读取的架构

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

parquet_path

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)
+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

:在尝试使用 from_json 进行失败的情况下,尝试获得相同的JSON模式。我无法弄清楚原因。请提供一些帮助。

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

抛出错误:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type: 

如果有人有兴趣,我正在尝试遵循此 post

parquet_path = /tmp/test-parquet

content of t2.json is:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

creating dataframe from t2.json

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

formed schema from:

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

Reading from parquet_path:

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)
+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

While trying to get the JSON schema of the same using from_json its failing. Im not able to figure out why. Please provide some help.

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

throws error:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type: 

In case anyone interested i'm trying to follow this post

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

百善笑为先 2025-02-12 06:59:34

您的代码中的一些问题。

  1. 您正在此行中的JSON失去关键数据。
df = df.withColumn('some-array', col('some-array').cast('string'))

因此,当您读取保存的镶木quet文件时,只会看到此内容。

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

而不是。 (这是您期望的。)

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

将数组结构施放为JSON字符串的正确方法是使用 to_json

df = df.withColumn('some-array', to_json('some-array'))

请检查 df.show() df.take(1)以查看差异。

  1. 您的DDL字符串需要(`)用于列名称。 (参考:
'array<struct<`array-field-1`:string,`array-field-2`:string>>'

A few issues in your code.

  1. You are losing key data from the JSON in this line.
df = df.withColumn('some-array', col('some-array').cast('string'))

So, when you are reading the parquet file that is saved, you only see this.

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

and NOT. (This is what you expect to have.)

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

The proper way to cast the array structure to JSON string is to use to_json.

df = df.withColumn('some-array', to_json('some-array'))

Please check df.show() or df.take(1) to see the difference.

  1. Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)
'array<struct<`array-field-1`:string,`array-field-2`:string>>'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文