使用列表中包含字典的行扩展数据框

发布于 2025-01-15 20:01:02 字数 2354 浏览 2 评论 0原文

我有大约 300.000 行,如下所示,但我需要的只是 id 和电子邮件地址。像这样的数据框:

d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
 'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],

[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0, 
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}

df=pd.DataFrame(d)
df

vid   col2
1201    [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202    [{'vid': 1202, 'saved-at-timestamp': 1548608578109......

预期输出(只有两个字段,但适用于所有行):

vid   email
1201  [email protected]
1202  [email protected]
..    ..

我尝试在这里应用解决方案,但没有成功

I have about 300.000 rows as below, but what I need is only id and email address. Dataframe like this:

d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
 'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],

[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0, 
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}

df=pd.DataFrame(d)
df

vid   col2
1201    [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202    [{'vid': 1202, 'saved-at-timestamp': 1548608578109......

expected output (only two fields but for all rows):

vid   email
1201  [email protected]
1202  [email protected]
..    ..

I tried to apply the solutions here but it didn't work

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

兲鉂ぱ嘚淚 2025-01-22 20:01:02

您可以使用 pd.json_normalize

df = pd.json_normalize([sub for item in d['col2'] for sub in item], record_path='identities', meta='vid')

输出:

>>> df
        type                                 value      timestamp is-primary   vid
0      EMAIL                        [email protected]  1548608578090       True  1201
1  LEAD_GUID  69c4f6ec-e0e9-4632-8d16-cbc204a57b22  1548608578106        NaN  1201
2      EMAIL                        [email protected]  1548608578088       True  1202
3  LEAD_GUID  fe6c2628-b1db-47c5-91f6-258e79ea58f0  1548608578106        NaN  1202

现在只需使用 .loc 即可获取所需的数据:

df = df.loc[df['type'] == 'EMAIL', ['vid', 'value']]

输出:

>>> df
    vid           value
0  1201  [email protected]
2  1202  [email protected]

或者您可以在使用 json_normalize 后旋转数据帧,而不是使用 .loc

df = df.pivot(index='vid', columns='type', values='value').rename_axis(None, axis=1).reset_index()

输出:

>>> df
    vid           EMAIL                             LEAD_GUID
0  1201  [email protected]  69c4f6ec-e0e9-4632-8d16-cbc204a57b22
1  1202  [email protected]  fe6c2628-b1db-47c5-91f6-258e79ea58f0

You can use pd.json_normalize:

df = pd.json_normalize([sub for item in d['col2'] for sub in item], record_path='identities', meta='vid')

Output:

>>> df
        type                                 value      timestamp is-primary   vid
0      EMAIL                        [email protected]  1548608578090       True  1201
1  LEAD_GUID  69c4f6ec-e0e9-4632-8d16-cbc204a57b22  1548608578106        NaN  1201
2      EMAIL                        [email protected]  1548608578088       True  1202
3  LEAD_GUID  fe6c2628-b1db-47c5-91f6-258e79ea58f0  1548608578106        NaN  1202

And now just use .loc to get the data you want:

df = df.loc[df['type'] == 'EMAIL', ['vid', 'value']]

Output:

>>> df
    vid           value
0  1201  [email protected]
2  1202  [email protected]

Or you can pivot the dataframe after using json_normalize, instead of using .loc:

df = df.pivot(index='vid', columns='type', values='value').rename_axis(None, axis=1).reset_index()

Output:

>>> df
    vid           EMAIL                             LEAD_GUID
0  1201  [email protected]  69c4f6ec-e0e9-4632-8d16-cbc204a57b22
1  1202  [email protected]  fe6c2628-b1db-47c5-91f6-258e79ea58f0
浮萍、无处依 2025-01-22 20:01:02

这是使用 json_normalize 的一种方法:

out = (pd.concat(pd.json_normalize(lst, ['identities'], 'vid') for lst in d['col2'])
       .pipe(lambda x: x[x['type']=='EMAIL'])[['vid','value']]
       .rename(columns={'value':'email'}))

或者仅对“电子邮件”重复使用 str 访问器:

df=pd.DataFrame(d)
df['email'] = df['col2'].str[0].str.get('identities').str[0].str.get('value')
df = df.drop(columns='col2')

输出:

    vid           email
0  1201  [email protected]
0  1202  [email protected]

Here's one way using json_normalize:

out = (pd.concat(pd.json_normalize(lst, ['identities'], 'vid') for lst in d['col2'])
       .pipe(lambda x: x[x['type']=='EMAIL'])[['vid','value']]
       .rename(columns={'value':'email'}))

or just use str accessor repeatedly for only the "emails":

df=pd.DataFrame(d)
df['email'] = df['col2'].str[0].str.get('identities').str[0].str.get('value')
df = df.drop(columns='col2')

Output:

    vid           email
0  1201  [email protected]
0  1202  [email protected]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文