在Python中读取二进制文件:读取某些字节需要很长时间

发布于 2024-08-21 18:34:29 字数 644 浏览 4 评论 0原文

这很奇怪,

我正在使用 Python 中的 numpy 库读取一些(诚然非常大:每个约 2GB)二进制文件。 我正在使用:

thingy = np.fromfile(fileObject, np.int16, 1)

方法。 这是在嵌套循环的中间 - 我每个“通道”执行此循环 4096 次,此“通道”为每个“接收器”循环 9 次,此“接收器”循环 4 次(有 9 个通道)每个接收器,其中有 4 个!)。这适用于每个“块”,其中每个文件大约有 3600 个。

所以你可以看到,非常迭代,我知道这会花费很长时间,但它花费的时间比我预期的要长得多 - 每个“块”平均需要 8.5 秒。

我使用 time.clock() 等运行了一些基准测试,发现一切都按照应有的速度进行,除了每个“块”大约 1 或 2 个样本(因此 4096*9*4 中的 1 或 2 个样本)之外“卡住”几秒钟。现在这应该是从二进制返回一个简单的 int16 的情况,而不是应该花费几秒钟的时间......为什么它会粘在一起?

从基准测试中,我发现它每次都卡在同一个位置(块 2、接收器 8、通道 3、样本 1085 是其中之一,郑重声明!),并且它会卡在那里大约相同的时间量。每次运行的时间。

有什么想法吗?

谢谢,

邓肯

This is very odd

I'm reading some (admittedly very large: ~2GB each) binary files using numpy libraries in Python.
I'm using the:

thingy = np.fromfile(fileObject, np.int16, 1)

method.
This is right in the middle of a nested loop - I'm doing this loop 4096 times per 'channel', and this 'channel' loop 9 times for every 'receiver', and this 'receiver' loop 4 times (there's 9 channels per receiver, of which there are 4!). This is for every 'block', of which there are ~3600 per file.

So you can see, very iterative and I know it will take a long time, but it was taking a LOT longer than I expected - on average 8.5 seconds per 'block'.

I ran some benchmarks using time.clock() etc. and found everything going as fast as it should be, except for approximately 1 or 2 samples per 'block' (so 1 or 2 in 4096*9*4) where it would seem to get 'stuck' on for a few seconds. Now this should be a case of returning a simple int16 from binary, not exactly something that should be taking seconds... why is it sticking?

From the benchmarking I found it was sticking in the SAME place every time, (block 2, receiver 8, channel 3, sample 1085 was one of them, for the record!), and it would get stuck there for approximately the same amount of time each run.

Any ideas?!

Thanks,

Duncan

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

薄荷→糖丶微凉 2024-08-28 18:34:29

尽管没有某种可重现的样本很难说,但这听起来像是一个缓冲问题。第一部分是缓冲的,直到到达缓冲区的末尾为止,它的速度很快;然后它会减慢速度,直到下一个缓冲区被填满,依此类推。

Although it's hard to say without some kind of reproducible sample, this sounds like a buffering problem. The First part is buffered and until you reach the end of the buffer, it is fast; then it slows down until the next buffer is filled, and so on.

违心° 2024-08-28 18:34:29

您将结果存储在哪里?当列表/字典/任何东西变得非常大时,需要重新分配和调整大小时可能会出现明显的延迟。

Where are you storing the results? When lists/dicts/whatever get very large there can be a noticeable delay when they need to be reallocated and resized.

洋洋洒洒 2024-08-28 18:34:29

难道是垃圾收集正在启动列表?

补充:这是有趣的数据,还是 blockno ?如果你按照随机顺序读取这些块,会发生什么

r = range(4096)
random.shuffle(r)  # inplace
for blockno in r:
    file.seek( blockno * ... )
    ...

Could it be that garbage collection is kicking in for the lists ?

Added: is it funny data, or blockno ? What happens if you read the blocks in random order, along the lines

r = range(4096)
random.shuffle(r)  # inplace
for blockno in r:
    file.seek( blockno * ... )
    ...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文