HDF5:存储 NumPy 数据

发布于 2024-10-01 20:49:23 字数 584 浏览 0 评论 0原文

当我使用 NumPy 时,我以本机格式 *.npy 存储它的数据。它非常快,给了我一些好处,比如

  • 我可以从 C 代码中读取 *.npy 简单的二进制数据(我的意思是 *.npy 是 与 C 结构二进制兼容)

现在我正在处理 HDF5(此时为 PyTables)。正如我在教程中读到的,他们使用 NumPy 序列化器来存储 NumPy 数据,因此我可以像从简单的 *.npy 文件一样从 C 读取这些数据?

HDF5 的 numpy 也与 C 结构二进制兼容吗?

UPD:

我有matlab客户端从hdf5读取,但不想从C++读取hdf5,因为从*.npy读取二进制数据要快很多倍,所以我真的需要从C++读取hdf5(二进制兼容性) 所以我已经在使用两种方法来传输数据 - *.npy(从 C++ 读取字节,从 Python 本机读取)和 hdf5(从 Matlab 访问) 如果可能的话,想使用唯一的一种方法 - hdf5,但要做到这一点,我必须找到一种方法使 hdf5 与 C++ 结构二进制兼容,请帮忙,如果有某种方法可以关闭 hdf5 中的压缩或者其他使 hdf5 与 C++ 结构二进制兼容的东西 - 告诉我在哪里可以读到它......

when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one

  • I could read *.npy from C code as
    simple binary data(I mean *.npy are
    binary-compatibly with C structures)

Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?

Does HDF5's numpy are binary-compatibly with C structures too?

UPD :

I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility)
So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab)
And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

黄昏下泛黄的笔记 2024-10-08 20:49:23

从 C 读取 hdf5 文件的正确方法是使用 hdf5 API - 请参阅此教程。原则上,假设您没有在 hdf5 文件中使用压缩等高级存储选项,则可以像使用 .npy 文件一样直接从 hdf5 文件读取原始数据。然而,这本质上违背了使用 hdf5 格式的全部意义,我想不出这样做有什么好处,而不是使用正确的 hdf5 API。另请注意,该 API 有一个简化的高级版本,这应该使从 C 语言读取相对轻松。

The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.

嘿哥们儿 2024-10-08 20:49:23

我感受到你的痛苦。我一直在广泛处理存储在 HDF5 格式文件中的大量数据,并且收集了一些您可能会觉得有用的信息。

如果您“控制”文件创建(并写入数据 - 即使您使用 API),您应该能够在很大程度上完全绕过 HDF5 库。

如果输出数据集没有分块,它们将被连续写入。只要您没有在数据类型定义中指定任何字节顺序转换(即您指定数据应以本机浮点/双精度/整数格式编写),您应该能够实现“二进制兼容性”它。

为了解决我的问题,我使用文件规范 http://www 编写了一个 HDF5 文件解析器.hdfgroup.org/HDF5/doc/H5.format.html

使用相当简单的解析器,您应该能够识别任何数据集的偏移量(和大小)。此时只需 fseek 和 fread(在 C 中,也就是说,也许在 C++ 中可以采用更高级别的方法)。

如果您的数据集是分块的,则需要更多的解析来遍历用于组织块的 B 树。

您应该注意的唯一其他问题是处理任何(或消除)任何系统相关的结构填充。

I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.

If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.

If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.

To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html

With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).

If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.

The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.

小鸟爱天空丶 2024-10-08 20:49:23

HDF5 会为您处理结构的二进制兼容性。您只需告诉它您的结构由什么组成(dtype),并且保存/读取记录数组不会有任何问题 - 这是因为 numpy 和 HDF5 之间的类型系统基本上是 1:1。如果您使用 H5py,我有信心地说,只要您使用所有本机类型和大批量读/写(允许的整个数据集),IO 应该足够快。之后,它取决于分块和过滤器(例如随机播放、压缩) - 值得注意的是,有时这些过滤器可以通过大大减小文件大小来加快速度,因此请始终查看基准测试。请注意,类型和过滤器选择是在创建 HDF5 文档的最后进行的。

如果您尝试自己解析 HDF5,那么您就错了。如果您使用 C++/C,请使用 C++ 和 C api。 HDF5 团体网站上有所谓的“复合类型”的示例。

HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.

If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文