h5py写入速度极慢

发布于 2025-01-18 04:34:59 字数 383 浏览 2 评论 0原文

从数据集准备数据后,我想使用H5PY保存准备的数据。 数据是一个形状的float32 numpy阵列(16813、60、44、257)。准备数据非常快,只有几秒钟才能准备13GB的数据。但是,当我尝试使用H5PY将数据写入磁盘(500MB/S SSD)时,它会变得非常慢(等待我们的),甚至会冻结/崩溃计算机。

hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()

我计算出,内存中的数据应约为160GB。为什么这么慢?我尝试了多种事情,例如在准备时压缩,块,预定形状和写作。

After preparing data from a dataset, I want to save the prepared data using h5py.
The data is a float32 numpy array of shape (16813, 60, 44, 257). Preparing the data is very fast, only a few seconds to prepare 13GB of data. But when I try to write the data to disk (500mb/s SSD) using h5py it gets very slow (waited for ours) and it even freezes/crashes the computer.

hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()

I calculated that the data in memory should be around 160GB. Why is it so slow? I tried multiple things like compressing, chunking, predefine shape and write while preparing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

西瑶 2025-01-25 04:34:59

如果您正确地实现了块,则编写很多数据应该是“相对较快”(分钟,而不是小时)。没有分块,这可能需要很长时间。

为了演示如何使用块(并提供定时基准),我编写了一个简短的代码段,该段将数据集填充了一些随机数据(类型np.float32)。我会逐步创建和编写数据,因为没有足够的RAM存储大小(16_813,60,44,257)在内存中。

答案在2022-04-04 上更新此更新此地址的地址是在2022-04-02的评论中发布的代码。我修改了我的示例以shape =(1,60,44,257)而不是shape =(16_813,60,444,1)。我认为这与您正在编写的阵列形状相匹配。我还修改了块形,以匹配并添加变量以定义数据阵列和块形(以简化不同块和数据I/O尺寸的基准测试运行)。我进行了3种组合测试:

  1. ARR形状=(1,60,44,257)和块=(1,60,44,257)[2.58MB];在379秒(6m 19s)
  2. ARR形状=(1,60,44,257)和块=(100,60,44,257)[258MB];在949秒(15m 49s)
  3. ARR形状=(43,60,44,257); nloops = 391和块=(1,60,44,257);在377秒(6m 17s)

测试1和2的运行显示块大小对性能的影响。 H5PY文档建议将块尺寸保持在10 KB至1 MB之间 - 对于较大的数据集较大。参考: h5py块存储。您可以看到258MB块尺寸的测试2中的性能会大大降低。这可能会解释您的某些问题,但不应在编写5GB数据(IMHO)后导致您的系统冻结。

测试1和3显示了写入数组大小对性能的影响。当I/O数据块“太小”时,我发现写入性能降低。参考: Pytables的写入要比H5PY快得多在这种情况下,您可以看到性能不会受到增加写入数组大小的影响。换句话说,一次编写1行不会影响性能。

注意:我没有添加压缩。这减少了盘中文件大小,但增加了I/O的时间来即时压缩/未压缩数据。创建的文件大小为42.7 GB。

测试在带有24GB RAM和机械HDD(6GBPS @ 7200rpm)的旧Windows系统上进行。您应该使用SSD获得更快的时间。

下面更新的代码:

# dimensions of dataset  
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape 
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array 
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813

with h5py.File('sequences.h5', 'w') as h5f:
    ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
                            chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)   
    start = time.time()
    r_cnt = 0
    incr = time.time()
    for i in range(nloops):
        arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)            
        ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
        r_cnt += ar_a0
        if (i+1)%100 == 0 or i+1 == nloops:
            print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
            incr = time.time()            
        
    print(f'\nTotal time: {time.time()-start:.2f}')

If you implement chunking correctly, writing this much data should be "relatively fast" (minutes, not hours). Without chunking, this could take a very long time.

To demonstrate how to use chunking (and provide timing benchmarks), I wrote a short code segment that populates a dataset with some random data (of type np.float32). I create and write the data incrementally because don't have enough RAM to store an array of size (16_813,60,44,257) in memory.

Answer updated on 2022-04-04 This update addresses code posted in comments on 2022-04-02. I modified my example to write data with shape=(1,60,44,257) instead of shape=(16_813,60,44,1). I think this matches the array shape you are writing. I also modified the chunk shape to match and added variables to define data array and chunk shape (to simplify benchmarking runs for different chunk and data I/O sizes). I ran tests for 3 combinations:

  1. arr shape=(1,60,44,257) and chunks=(1,60,44,257) [2.58MB]; runs in 379 sec (6m 19s)
  2. arr shape=(1,60,44,257) and chunks=(100,60,44,257) [258MB]; runs in 949 sec (15m 49s)
  3. arr shape=(43,60,44,257);nloops=391 and chunks=(1,60,44,257); runs in 377 sec (6m 17s)

Tests 1 and 2 show influence of chunk size on performance. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. You can see performance degrades significantly in test 2 with the 258MB chunk size. This might account for some of your problem, but should not cause your system to freeze after writing 5GB data (IMHO).

Tests 1 and 3 show influence of write array size on performance. I have found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py In this case, you can see performance is not affected by increasing write array size. In other words, writing 1 row at a time does not affect performance.

Note: I did not add compression. This reduces the on-disk file size, but increases I/O time to compress/uncompressed the data on-the fly. Created file size is 42.7 GB.

Tests were run on an old Windows system with 24GB RAM and a mechanical HDD (6gbps @ 7200rpm). You should get much faster times with a SSD.

Updated code below:

# dimensions of dataset  
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape 
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array 
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813

with h5py.File('sequences.h5', 'w') as h5f:
    ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
                            chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)   
    start = time.time()
    r_cnt = 0
    incr = time.time()
    for i in range(nloops):
        arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)            
        ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
        r_cnt += ar_a0
        if (i+1)%100 == 0 or i+1 == nloops:
            print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
            incr = time.time()            
        
    print(f'\nTotal time: {time.time()-start:.2f}')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文