连接大量HDF5文件
我有大约 500 个 HDF5 文件,每个文件大约 1.5 GB。
每个文件都具有完全相同的结构,即 7 个复合(int、double、double)数据集和可变数量的样本。
现在我想通过连接每个数据集来连接所有这些文件,以便最终我有一个包含 7 个数据集的 750 GB 文件。
目前我正在运行一个 h5py 脚本,该脚本:
- 创建一个 HDF5 文件,其中包含无限最大的正确数据
- 集 按顺序打开所有文件
- 检查样本数是多少(因为它是可变的)
- 调整全局文件
- 大小 附加数据
这显然需要很多小时, 你有关于改进这个的建议吗?
我正在开发一个集群,所以我可以并行使用 HDF5,但我在 C 编程方面还不够好,无法自己实现一些东西,我需要一个已经编写的工具。
I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
- creates a HDF5 file with the right datasets of unlimited max
- open in sequence all the files
- check what is the number of samples (as it is variable)
- resize the global file
- append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我发现大部分时间都花在调整文件大小上,因为我在每一步都调整大小,所以我现在首先遍历所有文件并获取它们的长度(它是可变的)。
然后我创建全局 h5file,将总长度设置为所有文件的总和。
仅在此阶段之后,我才用所有小文件中的数据填充 h5 文件。
现在每个文件大约需要 10 秒,因此应该需要不到 2 小时,而之前需要更多时间。
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
我知道回答这个问题可以让我获得一枚死灵徽章 - 但最近我在这方面的情况有所改善。
在 Julia 中,这需要几秒钟。
label$i = h5read(original_filepath$i , “/标签”)
h5write(data_file_path, "/label", label)
如果您有组或更复杂的 hdf5 也可以这样做文件。
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
label$i = h5read(original_filepath$i, "/label")
h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
阿什利的回答对我来说很有效。以下是她在 Julia 中的建议的实现:
制作文本文件,列出要在 bash 中连接的文件:
编写一个 julia 脚本将多个文件连接到一个文件中:
然后使用以下命令执行上面的脚本文件:
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
Write a julia script to concatenate multiple files into one file:
Then execute the script file above using: