h5py写入速度极慢

发布于 2025-01-18 04:34:59 字数 383 浏览 2 评论 0原文

从数据集准备数据后，我想使用H5PY保存准备的数据。数据是一个形状的float32 numpy阵列（16813、60、44、257）。准备数据非常快，只有几秒钟才能准备13GB的数据。但是，当我尝试使用H5PY将数据写入磁盘（500MB/S SSD）时，它会变得非常慢（等待我们的），甚至会冻结/崩溃计算机。

hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()

我计算出，内存中的数据应约为160GB。为什么这么慢？我尝试了多种事情，例如在准备时压缩，块，预定形状和写作。

原文

After preparing data from a dataset, I want to save the prepared data using h5py.
The data is a float32 numpy array of shape (16813, 60, 44, 257). Preparing the data is very fast, only a few seconds to prepare 13GB of data. But when I try to write the data to disk (500mb/s SSD) using h5py it gets very slow (waited for ours) and it even freezes/crashes the computer.

hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()

I calculated that the data in memory should be around 160GB. Why is it so slow? I tried multiple things like compressing, chunking, predefine shape and write while preparing.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

西瑶 2025-01-25 04:34:59

如果您正确地实现了块，则编写很多数据应该是“相对较快”（分钟，而不是小时）。没有分块，这可能需要很长时间。

为了演示如何使用块（并提供定时基准），我编写了一个简短的代码段，该段将数据集填充了一些随机数据（类型np.float32）。我会逐步创建和编写数据，因为没有足够的RAM存储大小（16_813,60,44,257）在内存中。

答案在2022-04-04 上更新此更新此地址的地址是在2022-04-02的评论中发布的代码。我修改了我的示例以shape =（1,60,44,257）而不是shape =（16_813,60,444,1）。我认为这与您正在编写的阵列形状相匹配。我还修改了块形，以匹配并添加变量以定义数据阵列和块形（以简化不同块和数据I/O尺寸的基准测试运行）。我进行了3种组合测试：

ARR形状=（1,60,44,257）和块=（1,60,44,257）[2.58MB];在379秒（6m 19s）
ARR形状=（1,60,44,257）和块=（100,60,44,257）[258MB];在949秒（15m 49s）
ARR形状=（43,60,44,257）; nloops = 391和块=（1,60,44,257）;在377秒（6m 17s）

测试1和2的运行显示块大小对性能的影响。 H5PY文档建议将块尺寸保持在10 KB至1 MB之间 - 对于较大的数据集较大。参考： h5py块存储。您可以看到258MB块尺寸的测试2中的性能会大大降低。这可能会解释您的某些问题，但不应在编写5GB数据（IMHO）后导致您的系统冻结。

测试1和3显示了写入数组大小对性能的影响。当I/O数据块“太小”时，我发现写入性能降低。参考： Pytables的写入要比H5PY快得多在这种情况下，您可以看到性能不会受到增加写入数组大小的影响。换句话说，一次编写1行不会影响性能。

注意：我没有添加压缩。这减少了盘中文件大小，但增加了I/O的时间来即时压缩/未压缩数据。创建的文件大小为42.7 GB。

测试在带有24GB RAM和机械HDD（6GBPS @ 7200rpm）的旧Windows系统上进行。您应该使用SSD获得更快的时间。

下面更新的代码：

# dimensions of dataset  
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape 
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array 
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813

with h5py.File('sequences.h5', 'w') as h5f:
    ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
                            chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)   
    start = time.time()
    r_cnt = 0
    incr = time.time()
    for i in range(nloops):
        arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)            
        ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
        r_cnt += ar_a0
        if (i+1)%100 == 0 or i+1 == nloops:
            print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
            incr = time.time()            
        
    print(f'\nTotal time: {time.time()-start:.2f}')

If you implement chunking correctly, writing this much data should be "relatively fast" (minutes, not hours). Without chunking, this could take a very long time.

To demonstrate how to use chunking (and provide timing benchmarks), I wrote a short code segment that populates a dataset with some random data (of type np.float32). I create and write the data incrementally because don't have enough RAM to store an array of size (16_813,60,44,257) in memory.

Answer updated on 2022-04-04 This update addresses code posted in comments on 2022-04-02. I modified my example to write data with shape=(1,60,44,257) instead of shape=(16_813,60,44,1). I think this matches the array shape you are writing. I also modified the chunk shape to match and added variables to define data array and chunk shape (to simplify benchmarking runs for different chunk and data I/O sizes). I ran tests for 3 combinations:

arr shape=(1,60,44,257) and chunks=(1,60,44,257) [2.58MB]; runs in 379 sec (6m 19s)
arr shape=(1,60,44,257) and chunks=(100,60,44,257) [258MB]; runs in 949 sec (15m 49s)
arr shape=(43,60,44,257);nloops=391 and chunks=(1,60,44,257); runs in 377 sec (6m 17s)

Tests 1 and 2 show influence of chunk size on performance. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. You can see performance degrades significantly in test 2 with the 258MB chunk size. This might account for some of your problem, but should not cause your system to freeze after writing 5GB data (IMHO).

Tests 1 and 3 show influence of write array size on performance. I have found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py In this case, you can see performance is not affected by increasing write array size. In other words, writing 1 row at a time does not affect performance.

Note: I did not add compression. This reduces the on-disk file size, but increases I/O time to compress/uncompressed the data on-the fly. Created file size is 42.7 GB.

Tests were run on an old Windows system with 24GB RAM and a mechanical HDD (6gbps @ 7200rpm). You should get much faster times with a SSD.

Updated code below:

# dimensions of dataset  
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape 
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array 
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813

with h5py.File('sequences.h5', 'w') as h5f:
    ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
                            chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)   
    start = time.time()
    r_cnt = 0
    incr = time.time()
    for i in range(nloops):
        arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)            
        ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
        r_cnt += ar_a0
        if (i+1)%100 == 0 or i+1 == nloops:
            print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
            incr = time.time()            
        
    print(f'\nTotal time: {time.time()-start:.2f}')

回复收藏 0 原文

~没有更多了~