当前位置：文江博客话题详情

apache-spark pyspark parquet

pyspark from_json失败了错误：不能以JSON格式解析架构：未识别的令牌＆＃x27; array＆＃x27;：warde tresgess（json string，number，array，array）

发布于 2025-02-05 06:59:34 字数 1910 浏览 2 评论 0 原文

parquet_path =/tmp/test-parquet

t2.json的内容是：

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

从 t2.json 创建dataframe

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

：读取的架构

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

从 parquet_path

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

：在尝试使用 from_json 进行失败的情况下，尝试获得相同的JSON模式。我无法弄清楚原因。请提供一些帮助。

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

抛出错误：

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type:

如果有人有兴趣，我正在尝试遵循此 post

原文

parquet_path = /tmp/test-parquet

content of t2.json is:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

creating dataframe from t2.json

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

formed schema from:

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

Reading from parquet_path:

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

While trying to get the JSON schema of the same using from_json its failing. Im not able to figure out why. Please provide some help.

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

throws error:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type:

In case anyone interested i'm trying to follow this post

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

百善笑为先 2025-02-12 06:59:34

您的代码中的一些问题。

您正在此行中的JSON失去关键数据。

df = df.withColumn('some-array', col('some-array').cast('string'))

因此，当您读取保存的镶木quet文件时，只会看到此内容。

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

而不是。（这是您期望的。）

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

将数组结构施放为JSON字符串的正确方法是使用 to_json 。

df = df.withColumn('some-array', to_json('some-array'))

请检查 df.show（）或 df.take（1）以查看差异。

您的DDL字符串需要（`）用于列名称。（参考：）

'array<struct<`array-field-1`:string,`array-field-2`:string>>'

A few issues in your code.

You are losing key data from the JSON in this line.

df = df.withColumn('some-array', col('some-array').cast('string'))

So, when you are reading the parquet file that is saved, you only see this.

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

and NOT. (This is what you expect to have.)

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

The proper way to cast the array structure to JSON string is to use to_json.

df = df.withColumn('some-array', to_json('some-array'))

Please check df.show() or df.take(1) to see the difference.

Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)

'array<struct<`array-field-1`:string,`array-field-2`:string>>'

回复收藏 0 原文

~没有更多了~

关于作者

你怎么这么可爱啊

暂无简介

文章

26 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

pyspark from_json失败了错误：不能以JSON格式解析架构：未识别的令牌＆＃x27; array＆＃x27;：warde tresgess（json string，number，array，array）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

pyspark from_json失败了错误：不能以JSON格式解析架构：未识别的令牌＆＃x27; array＆＃x27;：warde tresgess（json string，number，array，array）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。