导入从Mongo到Pandas DataFrame的特定数据
我在MongoDB的集合中有大量数据,我需要在Jupyter中使用Pandas和Pymongo进行分析。我正在尝试在数据框架中导入特定数据。
示例数据。
{
"stored": "2022-04-xx",
...
...
"completedQueues": [
"STATEMENT_FORWARDING_QUEUE",
"STATEMENT_PERSON_QUEUE",
"STATEMENT_QUERYBUILDERCACHE_QUEUE"
],
"activities": [
"https://example.com
],
"hash": "xxx",
"agents": [
"mailto:[email protected]"
],
"statement": { <=== I want to import the data from "statement"
"authority": {
"objectType": "Agent",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"stored": "2022-04-xxx",
"context": {
"platform": "Unknown",
"extensions": {
"http://example.com",
"xxx.com": {
"user_agent": "xxx"
},
"http://example.com": ""
}
},
"actor": {
"objectType": "xxx",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"timestamp": "2022-04-xxx",
"version": "1.0.0",
"id": "xxx",
"verb": {
"id": "http://example.com",
"display": {
"en-US": "viewed"
}
},
"object": {
"objectType": "xxx",
"id": "https://example.com",
"definition": {
"type": "http://example.com",
"name": {
"en-US": ""
},
"description": {
"en-US": "Viewed"
}
}
}
}, <=== up to here
"hasGeneratedId": true,
...
...
}
请注意,我只对嵌套在“语句”下的数据感兴趣,而不是对包含字符串的任何数据,即上面的“ statement_forwarding_queue”。
我要完成的工作是在数据框架中从“语句”(如上所述)导入数据,并以以下方式排列它们,例如:
ID | 授权objecttype | 授权授权 | 授权授权 | 存储的 | 上下文平台 | 上下文平台上下文扩展 | Actor Actions Actor Objecter Objects | Actor名称 | .. 。 |
---|---|---|---|---|---|---|---|---|---|
| | | | | | | | | |
| | | | | | | | | |
或“ Actor Objecttype”。
我尝试过:
df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))
有各种查询,过滤器和切片,并且还汇总和映射/减少,但没有任何产生正确的输出。
我还想根据“存储”字段(sort('$ natural',-1)?)对(最新到最旧)进行排序,也许也将限制(xx)应用于数据帧。
有什么想法吗?
提前致谢。
I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter. I am trying to import specific data in a dataframe.
Sample data.
{
"stored": "2022-04-xx",
...
...
"completedQueues": [
"STATEMENT_FORWARDING_QUEUE",
"STATEMENT_PERSON_QUEUE",
"STATEMENT_QUERYBUILDERCACHE_QUEUE"
],
"activities": [
"https://example.com
],
"hash": "xxx",
"agents": [
"mailto:[email protected]"
],
"statement": { <=== I want to import the data from "statement"
"authority": {
"objectType": "Agent",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"stored": "2022-04-xxx",
"context": {
"platform": "Unknown",
"extensions": {
"http://example.com",
"xxx.com": {
"user_agent": "xxx"
},
"http://example.com": ""
}
},
"actor": {
"objectType": "xxx",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"timestamp": "2022-04-xxx",
"version": "1.0.0",
"id": "xxx",
"verb": {
"id": "http://example.com",
"display": {
"en-US": "viewed"
}
},
"object": {
"objectType": "xxx",
"id": "https://example.com",
"definition": {
"type": "http://example.com",
"name": {
"en-US": ""
},
"description": {
"en-US": "Viewed"
}
}
}
}, <=== up to here
"hasGeneratedId": true,
...
...
}
Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.
What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:
id | authority objectType | authority name | authority mbox | stored | context platform | context extensions | actor objectType | actor name | ... |
---|---|---|---|---|---|---|---|---|---|
00 | Agent | xxx | mailto | 2022- | Unknown | http://1 | xxx | xxx | ... |
01 | Agent | yyy | mailto | 2022- | Unknown | http://2 | yyy | yyy | ... |
The idea is to be able to access any data like "authority name" or "actor objectType".
I have tried:
df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))
with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.
I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1) ?), and maybe apply limit(xx) to the dataframe as well.
Any ideas?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试一下
Try this
感谢您的答案,@Pavel。它是现实的,几乎可以解决这个问题。
我还添加了排序和限制,因此,如果有人有兴趣,最终代码看起来像这样:
Thanks for the answer, @pavel. It is spot on and pretty much solves the problem.
I also added sorting and limit, so if anyone is interested, the final code looks like this: