使用ijon包读取大json文件(http.client.IncompleteRead错误)
我正在尝试使用 ijson 包读取一个大 json 文件(>1,5Gb)并处理结果。
response = requests.get("https://api.scryfall.com/bulk-data/all-cards")
with urlopen(response.json()["download_uri"]) as all_cards:
for card_object in ijson.items(all_cards, "item"):
do_something_with(card_object)
但是,每次运行此命令时,我都会收到以下错误:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 555, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 522, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 587, in _readinto_chunked
chunk_left = self._get_chunk_left()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 557, in _get_chunk_left
raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/benjamin/PycharmProjects/octavin/venv/bin/flask", line 8, in <module>
sys.exit(main())
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 985, in main
cli.main()
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 579, in main
return super().main(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 427, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/app/cli.py", line 65, in update
for card_object in ijson.items(all_cards, "item"):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 492, in readinto
return self._readinto_chunked(b)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 603, in _readinto_chunked
raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(64016 bytes read)
是因为超时,还是因为文件太大?还是其他什么?
请注意,这是有效的(all-cards-20220408091307.json 是本地下载的文件):
with open("all-cards-20220408091307.json") as all_cards:
for card_object in ijson.items(all_cards, "item"):
do_something_with(card_object)
I'm trying to read a big json file (>1,5Gb), using ijson package and deal with the results.
response = requests.get("https://api.scryfall.com/bulk-data/all-cards")
with urlopen(response.json()["download_uri"]) as all_cards:
for card_object in ijson.items(all_cards, "item"):
do_something_with(card_object)
However each time I run this I get the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 555, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 522, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 587, in _readinto_chunked
chunk_left = self._get_chunk_left()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 557, in _get_chunk_left
raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/benjamin/PycharmProjects/octavin/venv/bin/flask", line 8, in <module>
sys.exit(main())
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 985, in main
cli.main()
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 579, in main
return super().main(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 427, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/benjamin/PycharmProjects/octavin/app/cli.py", line 65, in update
for card_object in ijson.items(all_cards, "item"):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 492, in readinto
return self._readinto_chunked(b)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 603, in _readinto_chunked
raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(64016 bytes read)
Is that because of any timeout, or because the file's too big? Or anything else?
Note that this is working (all-cards-20220408091307.json being the locally downloaded file):
with open("all-cards-20220408091307.json") as all_cards:
for card_object in ijson.items(all_cards, "item"):
do_something_with(card_object)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从分块编码的响应中读取数据时,这似乎是 http.client 的
HTTPResponse
的问题: https://bugs.python.org/issue39371。由于您已经在使用
requests
我建议您使用它来执行第二个请求并完全避免此问题。requests
的响应对象有一个iter_content
方法,可用于从传入流中增量读取二进制数据。另一方面,ijson
需要一个类似文件的对象。为了弥补这一差距,您可以使用类似于此处建议的解决方案:https://github.com/ICRAR/ijson/issues/58#issuecomment-917655522; 否则您可以使用ijson
的推送机制,您可以在其中进行读取并将数据块移交给 ijson(这有点复杂,请参阅 ijson 的文档以了解更多详细信息)。This seems to be a problem with http.client's
HTTPResponse
when reading data from a response with chunked encoding: https://bugs.python.org/issue39371.Since you're already using
requests
I'd suggest you use that to perform your second request and avoid this issue altogether.requests
's response object has aniter_content
method that can be used to incrementally read binary data from the incoming stream.ijson
on the other hand expects a file-like object. To bridge the gap you can use a solution similar to the one suggested here: https://github.com/ICRAR/ijson/issues/58#issuecomment-917655522; otherwise you can useijson
's push mechanism, where you do the reading and hand over the data chunks toijson
(which is a bit more complex, seeijson
's documentation documentation for more details).