如何使用 Python 解析“mongodump”存档输出中的集合?
上下文
我有一个每天使用以下命令备份的 MongoDB,
mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary
我想仅使用 python 解析此转储文件,以获取所有底层 BSON 文档。 我想将 BSON 转换为 JSON。
我尝试过的
假设我有一个名为 my_db
的数据库和一个名为 my_employees
的集合,其中仅包含两个文档
{"name": "john doe"}
{"name": "jane doe"}
我使用以下命令转储了这个集合
mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees
我 gunzip 转储文件。
现在我尝试仅使用 python
和 pymongo
解析文件。我尝试从这个 Go 解析器。
我不知道 Go,但我的理解是转储文件包含零个或多个块,每个块具有以下结构
terminator_or_size_of_bson: 4 bytes
bson_document: N bytes
这是我想出的代码(它不能处理很多事情,但它是快速草稿)
import bson
dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)
i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
bson_document = bson.decode_all(bson_document_bytes)
print(bson_document)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
这是我遇到的错误
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 27, in <module>
bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size
您可以看到前四个字节的值 2174345837
超出了允许的 16MB 文档size
我使用了不同的 BSON API
# ... only this loop changes
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
itr = bson.decode_iter(bson_document_bytes)
for rec in itr:
print(rec)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
并且 结果
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 28, in <module>
for rec in itr:
File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document
这是我不想使用 mongoexport 的 也不是 mongrestore
来解析存档转储。
感谢您的帮助
Context
I have a MongoDB that is backed up every day using the following command
mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary
I want to parse this dump file using python only, to get all the underlying BSON documents.
I want to convert the BSON into JSON.
What I tried
Let's say I have a single db named my_db
and a single collection named my_employees
which contains two documents only
{"name": "john doe"}
{"name": "jane doe"}
I dumped this single collection using the following command
mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees
I gunzip
the dump file.
Now I try to parse the file using only python
and pymongo
. I try to take inspiration from this Go parser.
I don't know Go but what I understood is that the dump file contains zero or more blocks each block has the following structure
terminator_or_size_of_bson: 4 bytes
bson_document: N bytes
Here is the code I came up with (it doesn't handle a lot of things, but it's quick draft)
import bson
dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)
i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
bson_document = bson.decode_all(bson_document_bytes)
print(bson_document)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
here is the error I have
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 27, in <module>
bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size
You can see that the first four bytes have a value 2174345837
that exceeds the allowed 16MB document size
I used a different BSON API
# ... only this loop changes
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
itr = bson.decode_iter(bson_document_bytes)
for rec in itr:
print(rec)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
And here is the result I have
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 28, in <module>
for rec in itr:
File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document
I don't want to use mongoexport
nor mongrestore
to parse the archive dump.
Thanks for your help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我已经使用 nodejs 解析了 mongodump 存档
存档的前 4 个字节是幻数,而不是文档的大小。
在幻数之后,第一个 bson 数据是存档头,其值是
下一个 bson 数据是所有带有索引信息的集合,您需要循环此操作,直到第一个值为 0xffffffff 的终止符
那么下一个 bson 数据是你的所有文档,直到最后一个终止符
I already parse the mongodump archive using nodejs
the first 4 bytes of the archive are the magic number not the size of the documents.
after the magic number, the first bson data is the archive header with the value of
the next bson data is all your collections with index information, you need to loop this until the first terminator with the value of 0xffffffff
then the next bson data is all your documents until the last terminator