如何优化使用H5PY的顺序写入以提高读取文件后的速度？

发布于 2025-01-19 21:51:17 字数 1132 浏览 3 评论 0原文

我处理一些输入数据，如果我一次完成所有这些操作，将为我提供一个 float32 和典型形状 (5000, 30000000) 的数据集。（第 0 个轴的长度是固定的，第一个轴的长度有所不同，但在开始之前我确实知道它会是什么）。

由于大约 600GB 并且无法容纳在内存中，因此我必须沿第一个轴将其切割并以 (5000, blocksize) 的块进行处理。我无法沿第 0 轴将其切割，并且由于 RAM 限制，块大小通常约为 40000。目前，我正在按顺序将每个块写入 hdf5 数据集，创建数据集如下：

fout = h5py.File(fname, "a")

blocksize = 40000

block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))

然后循环遍历块并通过添加到它

fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write

这可以工作并在可接受的时间内运行。

我需要输入下一步的最终产品是输出的每一行的二进制文件。这是别人的软件，所以不幸的是我在那里没有灵活性。

问题是读取一行

fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]

需要大约 4 分钟，而读取 5000 行则太长了！

有什么方法可以改变我的写入以便我的读取速度更快吗？或者还有什么我可以做的吗？

我应该在 hdf5 文件中为每个单独的行设置自己的数据集吗？我认为进行大量的单独写入会太慢，但也许更好？

我尝试直接编写二进制文件 - 在循环之外打开它们，在循环期间写入它们，然后在之后关闭它们 - 但我遇到了 OSError: [Errno 24] Too much open files。我还没有尝试过，但我认为打开文件并在循环内关闭它们会使速度太慢。

原文

I process some input data which, if I did it all at once, would give me a dataset of float32s and typical shape (5000, 30000000). (The length of the 0th axis is fixed, the 1st varies, but I do know what it will be before I start).

Since that's ~600GB and won't fit in memory I have to cut it up along the 1st axis and process it in blocks of (5000, blocksize). I cannot cut it up along the 0th axis, and due to RAM constraints blocksize is typically around 40000. At the moment I'm writing each block to an hdf5 dataset sequentially, creating the dataset like:

fout = h5py.File(fname, "a")

blocksize = 40000

block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))

and then looping through blocks and adding to it via

fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write

This works and runs in an acceptable amount of time.

The end product I need to feed into the next step is a binary file for each row of the output. It's someone else's software so unfortunately I have no flexibility there.

The problem is that reading in one row like

fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]

takes ~4min and with 5000 rows, that's way too long!

Is there any way I can alter my write so that my read is faster? Or is there anything else I can do instead?

Should I make each individual row its own data set within the hdf5 file? I assumed that doing lots of individual writes would be too slow but maybe it's better?

I tried writing the binary files directly - opening them outside of the loop, writing to them during the loops, and then closing them afterwards - but I ran into OSError: [Errno 24] Too many open files. I haven't tried it but I assume opening the files and closing them inside the loop would make it way too slow.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我三岁 2025-01-26 21:51:17

您的问题类似于我最近回答的上一个 SO/h5py 问题：h5py 写入速度极慢。显然，您获得了可接受的写入性能，并且希望提高读取性能。

影响 h5py I/O 性能的 2 个最重要因素是：1）块大小/形状，2）I/O 数据块的大小。 h5py 文档建议将块大小保持在 10 KB 到 1 MB 之间——数据集越大，块大小越大。参考：h5py 分块存储。我还发现，当 I/O 数据块“太小”时，写入性能会下降。参考：pytables 写入速度比 h5py 快得多。你读取的数据块的大小肯定足够大。

因此，我最初的预感是研究块大小对 I/O 性能的影响。设置最佳块大小是一门艺术。调整该值的最佳方法是启用分块，让 h5py 定义默认大小，然后看看是否获得可接受的性能。您没有定义 chunks 参数。但是，由于您定义了 maxshape 参数，因此会自动启用默认大小的分块（基于数据集的初始大小）。（如果没有分块，这种大小的文件上的 I/O 将非常慢。）您的问题的另一个考虑因素：最佳块大小必须平衡写入数据块的大小 (5000 x 40_000) 与读取数据块的大小(1 x 30_000_000)。

我参数化了你的代码，这样我就可以修改尺寸。当我这样做时，我发现了一些有趣的事情。当我在创建文件后将其作为单独的进程运行时，读取数据的速度要快得多。而且，默认的块大小似乎可以提供足够的读取性能。（最初我打算对不同的块大小值进行基准测试。）

注意：我只创建了一个 78GB 文件（4_000_000 列）。在我的 Windows 系统上运行需要 13 分钟以上。我不想等待 90 分钟来创建一个 600GB 的文件。如果您想测试 30_000_000 列，可以修改 n_blocks=750。 :-) 所有代码都在本文末尾。

接下来我创建了一个单独的程序来读取数据。使用默认块大小（40, 625）时，读取性能很快。下面的计时输出：

Time to read first row: 0.28 (in sec)
Time to read last row:  0.28

有趣的是，我在每次测试中都没有得到相同的读取时间。上面的值非常一致，但偶尔我会得到 7-10 秒的读取时间。不知道为什么会发生这种情况。

我运行了 3 次测试（在所有情况下 block_to_write.shape=(500,40_000)）：

默认 chunksize=(40,625) [95KB]；对于 500x40_000 数据集（调整大小），
默认 chunksize=(10,15625) [596KB]；对于 500x4_000_000 数据集（未调整大小），
用户定义 chunksize=(10,40_000) [1.526MB]；对于 500x4_000_000 数据集（未调整大小）

较大的块可以提高读取性能，但使用默认值的速度相当快。（块大小对写入性能的影响非常小。）以下所有 3 个的输出。

dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28

dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06

dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02

用于创建下面的测试文件的代码：

with h5py.File(fname, 'w') as fout:
    blocksize = 40_000
    n_blocks = 100
    n_rows = 5_000
    block_to_write = np.random.random((n_rows, blocksize))
    start = time.time()
    for cnt in range(n_blocks):
        incr = time.time()
        print(f'Working on loop: {cnt}', end='')
        if "data" not in fout:
            fout.create_dataset("data", shape=(n_rows,blocksize), 
                        maxshape=(n_rows, None)) #, chunks=(10,blocksize))            
        else:    
            fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
        
        fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
        print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')

用于从下面的测试文件中读取 2 个不同数组的代码：

with h5py.File(fname, 'r') as fin:
    print(f'dataset shape: {fin["data"].shape}')
    print(f'dataset chunkshape: {fin["data"].chunks}')
    start = time.time()
    data = fin["data"][0,:]
    print(f'Time to read first row: {time.time()-start:.2f}')
    start = time.time()
    data = fin["data"][-1,:]
    print(f'Time to read last row: {time.time()-start:.2f}'

Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.

The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.

So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).

I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)

Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.

Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:

Time to read first row: 0.28 (in sec)
Time to read last row:  0.28

Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.

I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):

default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)

Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.

dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28

dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06

dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02

Code to create my test file below:

with h5py.File(fname, 'w') as fout:
    blocksize = 40_000
    n_blocks = 100
    n_rows = 5_000
    block_to_write = np.random.random((n_rows, blocksize))
    start = time.time()
    for cnt in range(n_blocks):
        incr = time.time()
        print(f'Working on loop: {cnt}', end='')
        if "data" not in fout:
            fout.create_dataset("data", shape=(n_rows,blocksize), 
                        maxshape=(n_rows, None)) #, chunks=(10,blocksize))            
        else:    
            fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
        
        fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
        print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')

Code to read 2 different arrays from the test file below:

with h5py.File(fname, 'r') as fin:
    print(f'dataset shape: {fin["data"].shape}')
    print(f'dataset chunkshape: {fin["data"].chunks}')
    start = time.time()
    data = fin["data"][0,:]
    print(f'Time to read first row: {time.time()-start:.2f}')
    start = time.time()
    data = fin["data"][-1,:]
    print(f'Time to read last row: {time.time()-start:.2f}'

回复收藏 0 原文

~没有更多了~