h5py写入速度极慢
从数据集准备数据后,我想使用H5PY保存准备的数据。 数据是一个形状的float32 numpy阵列(16813、60、44、257)。准备数据非常快,只有几秒钟才能准备13GB的数据。但是,当我尝试使用H5PY将数据写入磁盘(500MB/S SSD)时,它会变得非常慢(等待我们的),甚至会冻结/崩溃计算机。
hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()
我计算出,内存中的数据应约为160GB。为什么这么慢?我尝试了多种事情,例如在准备时压缩,块,预定形状和写作。
After preparing data from a dataset, I want to save the prepared data using h5py.
The data is a float32 numpy array of shape (16813, 60, 44, 257). Preparing the data is very fast, only a few seconds to prepare 13GB of data. But when I try to write the data to disk (500mb/s SSD) using h5py it gets very slow (waited for ours) and it even freezes/crashes the computer.
hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()
I calculated that the data in memory should be around 160GB. Why is it so slow? I tried multiple things like compressing, chunking, predefine shape and write while preparing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您正确地实现了块,则编写很多数据应该是“相对较快”(分钟,而不是小时)。没有分块,这可能需要很长时间。
为了演示如何使用块(并提供定时基准),我编写了一个简短的代码段,该段将数据集填充了一些随机数据(类型
np.float32
)。我会逐步创建和编写数据,因为没有足够的RAM存储大小(16_813,60,44,257)
在内存中。答案在2022-04-04 上更新此更新此地址的地址是在2022-04-02的评论中发布的代码。我修改了我的示例以
shape =(1,60,44,257)
而不是shape =(16_813,60,444,1)
。我认为这与您正在编写的阵列形状相匹配。我还修改了块形,以匹配并添加变量以定义数据阵列和块形(以简化不同块和数据I/O尺寸的基准测试运行)。我进行了3种组合测试:测试1和2的运行显示块大小对性能的影响。 H5PY文档建议将块尺寸保持在10 KB至1 MB之间 - 对于较大的数据集较大。参考: h5py块存储。您可以看到258MB块尺寸的测试2中的性能会大大降低。这可能会解释您的某些问题,但不应在编写5GB数据(IMHO)后导致您的系统冻结。
测试1和3显示了写入数组大小对性能的影响。当I/O数据块“太小”时,我发现写入性能降低。参考: Pytables的写入要比H5PY快得多在这种情况下,您可以看到性能不会受到增加写入数组大小的影响。换句话说,一次编写1行不会影响性能。
注意:我没有添加压缩。这减少了盘中文件大小,但增加了I/O的时间来即时压缩/未压缩数据。创建的文件大小为42.7 GB。
测试在带有24GB RAM和机械HDD(6GBPS @ 7200rpm)的旧Windows系统上进行。您应该使用SSD获得更快的时间。
下面更新的代码:
If you implement chunking correctly, writing this much data should be "relatively fast" (minutes, not hours). Without chunking, this could take a very long time.
To demonstrate how to use chunking (and provide timing benchmarks), I wrote a short code segment that populates a dataset with some random data (of type
np.float32
). I create and write the data incrementally because don't have enough RAM to store an array of size(16_813,60,44,257)
in memory.Answer updated on 2022-04-04 This update addresses code posted in comments on 2022-04-02. I modified my example to write data with
shape=(1,60,44,257)
instead ofshape=(16_813,60,44,1)
. I think this matches the array shape you are writing. I also modified the chunk shape to match and added variables to define data array and chunk shape (to simplify benchmarking runs for different chunk and data I/O sizes). I ran tests for 3 combinations:Tests 1 and 2 show influence of chunk size on performance. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. You can see performance degrades significantly in test 2 with the 258MB chunk size. This might account for some of your problem, but should not cause your system to freeze after writing 5GB data (IMHO).
Tests 1 and 3 show influence of write array size on performance. I have found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py In this case, you can see performance is not affected by increasing write array size. In other words, writing 1 row at a time does not affect performance.
Note: I did not add compression. This reduces the on-disk file size, but increases I/O time to compress/uncompressed the data on-the fly. Created file size is 42.7 GB.
Tests were run on an old Windows system with 24GB RAM and a mechanical HDD (6gbps @ 7200rpm). You should get much faster times with a SSD.
Updated code below: