如何使用 Python 解析“mongodump”存档输出中的集合？

发布于 2025-01-15 22:08:36 字数 3615 浏览 4 评论 0原文

上下文

我有一个每天使用以下命令备份的 MongoDB，

mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary

我想仅使用 python 解析此转储文件，以获取所有底层 BSON 文档。我想将 BSON 转换为 JSON。

我尝试过的

假设我有一个名为 my_db 的数据库和一个名为 my_employees 的集合，其中仅包含两个文档

{"name": "john doe"}
{"name": "jane doe"}

我使用以下命令转储了这个集合

mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees

我 gunzip 转储文件。

现在我尝试仅使用 python 和 pymongo 解析文件。我尝试从这个 Go 解析器。

我不知道 Go，但我的理解是转储文件包含零个或多个块，每个块具有以下结构

terminator_or_size_of_bson: 4 bytes
bson_document: N bytes

这是我想出的代码（它不能处理很多事情，但它是快速草稿）

import bson

dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)

i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0

while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    bson_document = bson.decode_all(bson_document_bytes)
    print(bson_document)
    nb_bsons_to_parse += 1
    i += i + 4 + bson_size

这是我遇到的错误

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 27, in <module>
    bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size

您可以看到前四个字节的值 2174345837 超出了允许的 16MB 文档size

我使用了不同的 BSON API


# ... only this loop changes 
while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    itr = bson.decode_iter(bson_document_bytes)
    for rec in itr:
        print(rec)
        nb_bsons_to_parse += 1
    i += i + 4 + bson_size

并且结果

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 28, in <module>
    for rec in itr:
  File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
    yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document

这是我不想使用 mongoexport 的也不是 mongrestore 来解析存档转储。

感谢您的帮助

原文

Context

I have a MongoDB that is backed up every day using the following command

mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary

I want to parse this dump file using python only, to get all the underlying BSON documents.
I want to convert the BSON into JSON.

What I tried

Let's say I have a single db named my_db and a single collection named my_employees which contains two documents only

{"name": "john doe"}
{"name": "jane doe"}

I dumped this single collection using the following command

mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees

I gunzip the dump file.

Now I try to parse the file using only python and pymongo. I try to take inspiration from this Go parser.

I don't know Go but what I understood is that the dump file contains zero or more blocks each block has the following structure

terminator_or_size_of_bson: 4 bytes
bson_document: N bytes

Here is the code I came up with (it doesn't handle a lot of things, but it's quick draft)

import bson

dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)

i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0

while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    bson_document = bson.decode_all(bson_document_bytes)
    print(bson_document)
    nb_bsons_to_parse += 1
    i += i + 4 + bson_size

here is the error I have

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 27, in <module>
    bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size

You can see that the first four bytes have a value 2174345837 that exceeds the allowed 16MB document size

I used a different BSON API


# ... only this loop changes 
while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    itr = bson.decode_iter(bson_document_bytes)
    for rec in itr:
        print(rec)
        nb_bsons_to_parse += 1
    i += i + 4 + bson_size

And here is the result I have

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 28, in <module>
    for rec in itr:
  File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
    yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document

I don't want to use mongoexport nor mongrestore to parse the archive dump.

Thanks for your help

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

油焖大侠 2025-01-22 22:08:36

我已经使用 nodejs 解析了 mongodump 存档
存档的前 4 个字节是幻数，而不是文档的大小。
在幻数之后，第一个 bson 数据是存档头，其值是

{
  concurrent_collections: 4,
  version: '0.1',
  server_version: '4.2.2',
  tool_version: 'r4.2.2'
}

下一个 bson 数据是所有带有索引信息的集合，您需要循环此操作，直到第一个值为 0xffffffff 的终止符
那么下一个 bson 数据是你的所有文档，直到最后一个终止符

I already parse the mongodump archive using nodejs
the first 4 bytes of the archive are the magic number not the size of the documents.
after the magic number, the first bson data is the archive header with the value of

{
  concurrent_collections: 4,
  version: '0.1',
  server_version: '4.2.2',
  tool_version: 'r4.2.2'
}

the next bson data is all your collections with index information, you need to loop this until the first terminator with the value of 0xffffffff
then the next bson data is all your documents until the last terminator

回复收藏 0 原文

~没有更多了~