如何“爆炸”?单元格上的数据格式化为字符串,以便我可以将键转换为 pyspark 上的列?

发布于 2025-01-17 04:42:05 字数 809 浏览 2 评论 0原文

我正在使用的数据集之一有一个名为 json_data 的列,其中包含如下数据:

{
    "eta": "",
    "eta_value": 0,
    "schedules": [{
        "open_time": "10:15:00",
        "close_time": "14:00:00"
    }, {
        "open_time": "18:00:00",
        "close_time": "20:00:00"
    }],
    "logo": "1617723892776.png",
    "score_v2": 0,
    "id": "900371722_8339714",
    "store_id": 900371722,
    "super_store_id": 900371722,
    "index": 375,
    "brand_name": "Carl's Restaurant",
    "store_type": "restaurant",
    "has_promise": false,
    "tags": [189],
    "background": "1618349497.jpg",
    "is_enabled": false,
    "friendly_url": {
        "store_id": 90037172
    }
}

该列是字符串类型,这意味着我无法轻松地将其中的信息转换为列。这就是我来到这里的原因:如何将数据转换为列?特别是“时间表”内的嵌套数据。

我在这个专栏上遇到了困难。

One of the datasets I'm working with has a column called json_data, which contains data like this:

{
    "eta": "",
    "eta_value": 0,
    "schedules": [{
        "open_time": "10:15:00",
        "close_time": "14:00:00"
    }, {
        "open_time": "18:00:00",
        "close_time": "20:00:00"
    }],
    "logo": "1617723892776.png",
    "score_v2": 0,
    "id": "900371722_8339714",
    "store_id": 900371722,
    "super_store_id": 900371722,
    "index": 375,
    "brand_name": "Carl's Restaurant",
    "store_type": "restaurant",
    "has_promise": false,
    "tags": [189],
    "background": "1618349497.jpg",
    "is_enabled": false,
    "friendly_url": {
        "store_id": 90037172
    }
}

This column is a string type, which means I cannot easily turn the info inside it into columns. And that's what brings me here: how can I turn the data here in columns? Specially with the nested data inside "schedules".

I'm having a hard time with this column.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吻泪 2025-01-24 04:42:05

几个月前,我也在为类似的 json 结构而苦苦挣扎。很高兴你提起它,帮助刷新了我的记忆!
我按照以下步骤解决 -

输入数据

df1.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|jsondata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{     "eta": "",     "eta_value": 0,     "schedules": [{         "open_time": "10:15:00",         "close_time": "14:00:00"     }, {         "open_time": "18:00:00",         "close_time": "20:00:00"     }],     "logo": "1617723892776.png",     "score_v2": 0,     "id": "900371722_8339714",     "store_id": 900371722,     "super_store_id": 900371722,     "index": 375,     "brand_name": "Carl's Restaurant",     "store_type": "restaurant",     "has_promise": false,     "tags": [189],     "background": "1618349497.jpg",     "is_enabled": false,     "friendly_url": {         "store_id": 90037172     } }|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

jsondata 列转换为 MapType 如下 -

from pyspark.sql.functions import *
from pyspark.sql.types import *

df2 = df1.withColumn("cols", from_json( "jsondata", MapType(StringType(), StringType()) )).drop("jsondata")
df2.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cols                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{eta -> , eta_value -> 0, schedules -> [{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}], logo -> 1617723892776.png, score_v2 -> 0, id -> 900371722_8339714, store_id -> 900371722, super_store_id -> 900371722, index -> 375, brand_name -> Carl's Restaurant, store_type -> restaurant, has_promise -> false, tags -> [189], background -> 1618349497.jpg, is_enabled -> false, friendly_url -> {"store_id":90037172}}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

现在,列 cols需要按如下方式展开 -

df3= df2.select(explode("cols").alias("col_columns", "col_rows"))
df3.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------+
|col_columns   |col_rows                                                                                           |
+--------------+---------------------------------------------------------------------------------------------------+
|eta           |                                                                                                   |
|eta_value     |0                                                                                                  |
|schedules     |[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|
|logo          |1617723892776.png                                                                                  |
|score_v2      |0                                                                                                  |
|id            |900371722_8339714                                                                                  |
|store_id      |900371722                                                                                          |
|super_store_id|900371722                                                                                          |
|index         |375                                                                                                |
|brand_name    |Carl's Restaurant                                                                                  |
|store_type    |restaurant                                                                                         |
|has_promise   |false                                                                                              |
|tags          |[189]                                                                                              |
|background    |1618349497.jpg                                                                                     |
|is_enabled    |false                                                                                              |
|friendly_url  |{"store_id":90037172}                                                                              |
+--------------+---------------------------------------------------------------------------------------------------+

一旦您将 col_columnscol_rows 作为单独的列,所需要做的就是对 col_columns 进行透视并聚合它使用其对应的第一个col_rows 如下 -

df4 = (df3.groupBy()
         .pivot("col_columns")
         .agg(first("col_rows"))
)
df4.show(truncate=False)

输出

+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+
|background    |brand_name       |eta|eta_value|friendly_url         |has_promise|id               |index|is_enabled|logo             |schedules                                                                                          |score_v2|store_id |store_type|super_store_id|tags |
+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+
|1618349497.jpg|Carl's Restaurant|   |0        |{"store_id":90037172}|false      |900371722_8339714|375  |false     |1617723892776.png|[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|0       |900371722|restaurant|900371722     |[189]|
+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+

PS - 如果您想分解诸如 schedulesFriendly_url< 等列/code> 那么您可能必须重复上述步骤。如下所示-

df4 = df4.withColumn("schedule_json", from_json("schedules", ArrayType(MapType(StringType(),StringType()))))

df4.select(explode("schedule_json").alias("schedules")).select(explode("schedules")).show(truncate=False)

+----------+--------+
|key       |value   |
+----------+--------+
|open_time |10:15:00|
|close_time|14:00:00|
|open_time |18:00:00|
|close_time|20:00:00|
+----------+--------+

Few months back, I was also struggling with similar json structure. Glad you brought it up, helped refreshing my memory!
I followed the below steps to resolve -

Input Data

df1.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|jsondata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{     "eta": "",     "eta_value": 0,     "schedules": [{         "open_time": "10:15:00",         "close_time": "14:00:00"     }, {         "open_time": "18:00:00",         "close_time": "20:00:00"     }],     "logo": "1617723892776.png",     "score_v2": 0,     "id": "900371722_8339714",     "store_id": 900371722,     "super_store_id": 900371722,     "index": 375,     "brand_name": "Carl's Restaurant",     "store_type": "restaurant",     "has_promise": false,     "tags": [189],     "background": "1618349497.jpg",     "is_enabled": false,     "friendly_url": {         "store_id": 90037172     } }|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Convert jsondata column to MapType as below -

from pyspark.sql.functions import *
from pyspark.sql.types import *

df2 = df1.withColumn("cols", from_json( "jsondata", MapType(StringType(), StringType()) )).drop("jsondata")
df2.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cols                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{eta -> , eta_value -> 0, schedules -> [{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}], logo -> 1617723892776.png, score_v2 -> 0, id -> 900371722_8339714, store_id -> 900371722, super_store_id -> 900371722, index -> 375, brand_name -> Carl's Restaurant, store_type -> restaurant, has_promise -> false, tags -> [189], background -> 1618349497.jpg, is_enabled -> false, friendly_url -> {"store_id":90037172}}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Now, column cols needs to be exploded as below -

df3= df2.select(explode("cols").alias("col_columns", "col_rows"))
df3.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------+
|col_columns   |col_rows                                                                                           |
+--------------+---------------------------------------------------------------------------------------------------+
|eta           |                                                                                                   |
|eta_value     |0                                                                                                  |
|schedules     |[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|
|logo          |1617723892776.png                                                                                  |
|score_v2      |0                                                                                                  |
|id            |900371722_8339714                                                                                  |
|store_id      |900371722                                                                                          |
|super_store_id|900371722                                                                                          |
|index         |375                                                                                                |
|brand_name    |Carl's Restaurant                                                                                  |
|store_type    |restaurant                                                                                         |
|has_promise   |false                                                                                              |
|tags          |[189]                                                                                              |
|background    |1618349497.jpg                                                                                     |
|is_enabled    |false                                                                                              |
|friendly_url  |{"store_id":90037172}                                                                              |
+--------------+---------------------------------------------------------------------------------------------------+

Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -

df4 = (df3.groupBy()
         .pivot("col_columns")
         .agg(first("col_rows"))
)
df4.show(truncate=False)

Output

+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+
|background    |brand_name       |eta|eta_value|friendly_url         |has_promise|id               |index|is_enabled|logo             |schedules                                                                                          |score_v2|store_id |store_type|super_store_id|tags |
+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+
|1618349497.jpg|Carl's Restaurant|   |0        |{"store_id":90037172}|false      |900371722_8339714|375  |false     |1617723892776.png|[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|0       |900371722|restaurant|900371722     |[189]|
+--------------+-----------------+---+---------+---------------------+-----------+-----------------+-----+----------+-----------------+---------------------------------------------------------------------------------------------------+--------+---------+----------+--------------+-----+

P.S. - If you want to explode the columns like schedules, friendly_url as well then you might have to repeat the above steps. Something as below -

df4 = df4.withColumn("schedule_json", from_json("schedules", ArrayType(MapType(StringType(),StringType()))))

df4.select(explode("schedule_json").alias("schedules")).select(explode("schedules")).show(truncate=False)

+----------+--------+
|key       |value   |
+----------+--------+
|open_time |10:15:00|
|close_time|14:00:00|
|open_time |18:00:00|
|close_time|20:00:00|
+----------+--------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文