导入从Mongo到Pandas DataFrame的特定数据

发布于 2025-01-21 22:42:22 字数 3179 浏览 2 评论 0原文

我在MongoDB的集合中有大量数据,我需要在Jupyter中使用Pandas和Pymongo进行分析。我正在尝试在数据框架中导入特定数据。

示例数据。

{
    "stored": "2022-04-xx",
    ...
    ...
    "completedQueues": [
        "STATEMENT_FORWARDING_QUEUE",
        "STATEMENT_PERSON_QUEUE",
        "STATEMENT_QUERYBUILDERCACHE_QUEUE"
    ],
    "activities": [
        "https://example.com
    ],
    "hash": "xxx",
    "agents": [
        "mailto:[email protected]"
    ],
    "statement": {                                  <=== I want to import the data from "statement"
        "authority": {
            "objectType": "Agent",
            "name": "xxx",
            "mbox": "mailto:[email protected]"
        },
        "stored": "2022-04-xxx",
        "context": {
            "platform": "Unknown",
            "extensions": {
                "http://example.com",
                "xxx.com": {
                    "user_agent": "xxx"
                },
                "http://example.com": ""
            }
        },
        "actor": {
            "objectType": "xxx",
            "name": "xxx",
            "mbox": "mailto:[email protected]"
        },
        "timestamp": "2022-04-xxx",
        "version": "1.0.0",
        "id": "xxx",
        "verb": {
            "id": "http://example.com",
            "display": {
                "en-US": "viewed"
            }
        },
        "object": {
            "objectType": "xxx",
            "id": "https://example.com",
            "definition": {
                "type": "http://example.com",
                "name": {
                    "en-US": ""
                },
                "description": {
                    "en-US": "Viewed"
                }
            }
        }
    },                                             <=== up to here
    "hasGeneratedId": true,
    ...
    ...
}

请注意,我只对嵌套在“语句”下的数据感兴趣,而不是对包含字符串的任何数据,即上面的“ statement_forwarding_queue”。

我要完成的工作是在数据框架中从“语句”(如上所述)导入数据,并以以下方式排列它们,例如:

ID授权objecttype授权授权授权授权存储的上下文平台上下文平台上下文扩展Actor Actions Actor Objecter ObjectsActor名称.. 。

​或“ Actor Objecttype”。

我尝试过:

df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))

有各种查询,过滤器和切片,并且还汇总和映射/减少,但没有任何产生正确的输出。

我还想根据“存储”字段(sort('$ natural',-1)?)对(最新到最旧)进行排序,也许也将限制(xx)应用于数据帧。

有什么想法吗?

提前致谢。

I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter. I am trying to import specific data in a dataframe.

Sample data.

{
    "stored": "2022-04-xx",
    ...
    ...
    "completedQueues": [
        "STATEMENT_FORWARDING_QUEUE",
        "STATEMENT_PERSON_QUEUE",
        "STATEMENT_QUERYBUILDERCACHE_QUEUE"
    ],
    "activities": [
        "https://example.com
    ],
    "hash": "xxx",
    "agents": [
        "mailto:[email protected]"
    ],
    "statement": {                                  <=== I want to import the data from "statement"
        "authority": {
            "objectType": "Agent",
            "name": "xxx",
            "mbox": "mailto:[email protected]"
        },
        "stored": "2022-04-xxx",
        "context": {
            "platform": "Unknown",
            "extensions": {
                "http://example.com",
                "xxx.com": {
                    "user_agent": "xxx"
                },
                "http://example.com": ""
            }
        },
        "actor": {
            "objectType": "xxx",
            "name": "xxx",
            "mbox": "mailto:[email protected]"
        },
        "timestamp": "2022-04-xxx",
        "version": "1.0.0",
        "id": "xxx",
        "verb": {
            "id": "http://example.com",
            "display": {
                "en-US": "viewed"
            }
        },
        "object": {
            "objectType": "xxx",
            "id": "https://example.com",
            "definition": {
                "type": "http://example.com",
                "name": {
                    "en-US": ""
                },
                "description": {
                    "en-US": "Viewed"
                }
            }
        }
    },                                             <=== up to here
    "hasGeneratedId": true,
    ...
    ...
}

Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.

What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:

idauthority objectTypeauthority nameauthority mboxstoredcontext platformcontext extensionsactor objectTypeactor name...
00Agentxxxmailto2022-Unknownhttp://1xxxxxx...
01Agentyyymailto2022-Unknownhttp://2yyyyyy...

The idea is to be able to access any data like "authority name" or "actor objectType".

I have tried:

df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))

with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.

I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1) ?), and maybe apply limit(xx) to the dataframe as well.

Any ideas?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

故乡的云 2025-01-28 22:42:22

尝试一下

df = json_normalize(list(
    collection.aggregate([
        {
            "$match": query
        },
        {
            "$replaceRoot": {
                "newRoot": "$statement"
            }
        }
    ])
)

Try this

df = json_normalize(list(
    collection.aggregate([
        {
            "$match": query
        },
        {
            "$replaceRoot": {
                "newRoot": "$statement"
            }
        }
    ])
)
叹沉浮 2025-01-28 22:42:22

感谢您的答案,@Pavel。它是现实的,几乎可以解决这个问题。

我还添加了排序和限制,因此,如果有人有兴趣,最终代码看起来像这样:

df = json_normalize(list(
  statements_coll.aggregate([
    {
        "$match": query
    },
    {
        "$replaceRoot": {
            "newRoot": "$statement"
        }
    },
    { 
        "$sort": { 
            "stored": -1 
        }
    },
    {
        "$limit": 10 
    }
  ]) 
))

Thanks for the answer, @pavel. It is spot on and pretty much solves the problem.

I also added sorting and limit, so if anyone is interested, the final code looks like this:

df = json_normalize(list(
  statements_coll.aggregate([
    {
        "$match": query
    },
    {
        "$replaceRoot": {
            "newRoot": "$statement"
        }
    },
    { 
        "$sort": { 
            "stored": -1 
        }
    },
    {
        "$limit": 10 
    }
  ]) 
))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文