使用列表中包含字典的行扩展数据框
我有大约 300.000 行,如下所示,但我需要的只是 id 和电子邮件地址。像这样的数据框:
d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],
[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}
df=pd.DataFrame(d)
df
vid col2
1201 [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202 [{'vid': 1202, 'saved-at-timestamp': 1548608578109......
预期输出(只有两个字段,但适用于所有行):
vid email
1201 [email protected]
1202 [email protected]
.. ..
I have about 300.000 rows as below, but what I need is only id and email address. Dataframe like this:
d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],
[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}
df=pd.DataFrame(d)
df
vid col2
1201 [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202 [{'vid': 1202, 'saved-at-timestamp': 1548608578109......
expected output (only two fields but for all rows):
vid email
1201 [email protected]
1202 [email protected]
.. ..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
pd.json_normalize
:输出:
现在只需使用
.loc
即可获取所需的数据:输出:
或者您可以在使用
json_normalize 后旋转数据帧
,而不是使用.loc
:输出:
You can use
pd.json_normalize
:Output:
And now just use
.loc
to get the data you want:Output:
Or you can pivot the dataframe after using
json_normalize
, instead of using.loc
:Output:
这是使用
json_normalize
的一种方法:或者仅对“电子邮件”重复使用
str
访问器:输出:
Here's one way using
json_normalize
:or just use
str
accessor repeatedly for only the "emails":Output: