从 HDF5 文件中删除数据
我有一个 HDF5 文件,其中包含复合元素的一维 (N x 1) 数据集 - 实际上它是一个时间序列。 首先将数据离线收集到 HFD5 文件中,然后进行分析。 在分析过程中,大多数数据变得无趣,只有某些部分是有趣的。 由于数据集可能非常大,我想删除不感兴趣的元素,同时保留有趣的元素。 例如,保留 500 个元素的数据集中的 0-100、200-300 和 350-400 元素,转储其余元素。 但如何呢?
有人有如何使用 HDF5 实现这一点的经验吗? 显然,可以通过多种方式完成,至少:(
- 明显的解决方案),创建一个新的新文件并在其中逐个元素写入必要的数据。 然后删除旧文件。
- 或者,在旧文件中创建一个新的数据集,在其中写入必要的数据,使用 H5Gunlink() 取消旧数据集的链接,并通过 h5repack 运行该文件来消除未声明的可用空间。
- 或者,将现有数据集中的有趣元素向开头移动(例如,将元素 200-300 移动到位置 101-201,将元素 350-400 移动到位置 202-252)。 然后调用H5Dset_extent()来减小数据集的大小。 然后也许运行 h5repack 来释放可用空间。
由于即使删除了无趣的元素,文件也可能很大,所以我宁愿不重写它们(这会花费很长时间),但似乎需要实际释放可用空间。 HDF5 专家有什么提示吗?
I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
- (Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
- Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
- Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
HDF5(至少是我习惯的版本,1.6.9)不允许删除。 事实上,它确实如此,但它并没有释放已用空间,结果你仍然有一个巨大的文件。 正如你所说,你可以使用h5repack,但这是浪费时间和资源。
您可以做的就是拥有一个包含布尔值的横向数据集,告诉您哪些值是“活动的”以及哪些值已被删除。 这不会使文件变小,但至少它为您提供了一种快速执行删除的方法。
另一种方法是在数组上定义一个slab,复制相关数据,然后删除旧数组,或者始终通过slab访问数据,然后根据需要重新定义它(不过我从来没有这样做过,所以我不确定是否可能,但应该)
最后,您可以使用 hdf5 安装策略将数据集放在安装在根 hdf5 上的“附加”hdf5 文件中。 当您想删除这些内容时,请将感兴趣的数据复制到另一个已安装的文件中,卸载旧文件并将其删除,然后在适当的位置重新安装新文件。 此解决方案可能很混乱(因为您有多个文件),但它允许您释放空间并仅对数据树的子部分进行操作,而不是使用重新打包。
HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.
不幸的是,复制数据或使用您所描述的 h5repack 是“缩小”HDF5 文件中数据的两种常用方法。
正如您可能已经猜到的那样,问题在于 HDF5 文件具有复杂的内部结构(文件格式为 这里,对于任何好奇的人来说),因此删除和缩小内容只会在相同大小的文件中留下漏洞。 HDF5 库的最新版本可以跟踪释放的空间并重新使用它,但您的用例似乎无法利用这一点。
正如另一个答案所提到的,您也许可以使用外部链接或虚拟数据集功能来构建 HDF5 文件,这些文件更适合您将要做的操作,但我怀疑您仍然会复制很多数据,这肯定会增加额外的复杂性和文件管理开销。
顺便说一句,H5Gunlink() 已被弃用。 H5Ldelete() 是首选替代品。
Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.