使用ijon包读取大json文件(http.client.IncompleteRead错误)

发布于 2025-01-20 04:06:55 字数 3868 浏览 0 评论 0原文

我正在尝试使用 ijson 包读取一个大 json 文件(>1,5Gb)并处理结果。

response = requests.get("https://api.scryfall.com/bulk-data/all-cards")    
with urlopen(response.json()["download_uri"]) as all_cards:
        for card_object in ijson.items(all_cards, "item"):
            do_something_with(card_object)

但是,每次运行此命令时,我都会收到以下错误:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 555, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 522, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 587, in _readinto_chunked
    chunk_left = self._get_chunk_left()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 557, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/benjamin/PycharmProjects/octavin/venv/bin/flask", line 8, in <module>
    sys.exit(main())
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 985, in main
    cli.main()
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 579, in main
    return super().main(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 427, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/app/cli.py", line 65, in update
    for card_object in ijson.items(all_cards, "item"):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 492, in readinto
    return self._readinto_chunked(b)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 603, in _readinto_chunked
    raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(64016 bytes read)

是因为超时,还是因为文件太大?还是其他什么?

请注意,这是有效的(all-cards-20220408091307.json 是本地下载的文件):

with open("all-cards-20220408091307.json") as all_cards:
    for card_object in ijson.items(all_cards, "item"):
        do_something_with(card_object)

I'm trying to read a big json file (>1,5Gb), using ijson package and deal with the results.

response = requests.get("https://api.scryfall.com/bulk-data/all-cards")    
with urlopen(response.json()["download_uri"]) as all_cards:
        for card_object in ijson.items(all_cards, "item"):
            do_something_with(card_object)

However each time I run this I get the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 555, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 522, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 587, in _readinto_chunked
    chunk_left = self._get_chunk_left()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 557, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/benjamin/PycharmProjects/octavin/venv/bin/flask", line 8, in <module>
    sys.exit(main())
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 985, in main
    cli.main()
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 579, in main
    return super().main(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 427, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/app/cli.py", line 65, in update
    for card_object in ijson.items(all_cards, "item"):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 492, in readinto
    return self._readinto_chunked(b)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 603, in _readinto_chunked
    raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(64016 bytes read)

Is that because of any timeout, or because the file's too big? Or anything else?

Note that this is working (all-cards-20220408091307.json being the locally downloaded file):

with open("all-cards-20220408091307.json") as all_cards:
    for card_object in ijson.items(all_cards, "item"):
        do_something_with(card_object)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦醒灬来后我 2025-01-27 04:06:55

从分块编码的响应中读取数据时,这似乎是 http.client 的 HTTPResponse 的问题: https://bugs.python.org/issue39371

由于您已经在使用 requests 我建议您使用它来执行第二个请求并完全避免此问题。 requests 的响应对象有一个 iter_content 方法,可用于从传入流中增量读取二进制数据。另一方面,ijson 需要一个类似文件的对象。为了弥补这一差距,您可以使用类似于此处建议的解决方案:https://github.com/ICRAR/ijson/issues/58#issuecomment-917655522; 否则您可以使用 ijson 的推送机制,您可以在其中进行读取并将数据块移交给 ijson(这有点复杂,请参阅 ijson 的文档以了解更多详细信息)。

This seems to be a problem with http.client's HTTPResponse when reading data from a response with chunked encoding: https://bugs.python.org/issue39371.

Since you're already using requests I'd suggest you use that to perform your second request and avoid this issue altogether. requests's response object has an iter_content method that can be used to incrementally read binary data from the incoming stream. ijson on the other hand expects a file-like object. To bridge the gap you can use a solution similar to the one suggested here: https://github.com/ICRAR/ijson/issues/58#issuecomment-917655522; otherwise you can use ijson's push mechanism, where you do the reading and hand over the data chunks to ijson (which is a bit more complex, see ijson's documentation documentation for more details).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文