Google Protocol Buffers、HDF5、NumPy 比较(传输数据)
我需要帮助才能做出决定。我需要在我的应用程序中传输一些数据,并且必须在这 3 种技术之间做出选择。 我已经阅读了一些有关所有技术的内容(教程、文档),但仍然无法决定......
它们如何比较?
我需要元数据的支持(无需任何附加信息/文件即可接收和读取文件的能力)、快速读/写操作、存储动态数据的能力将是一个优势(如Python对象)
我已经知道的事情:
- NumPy 速度相当快,但无法存储动态数据(如 Python 对象)。 (元数据呢?)
- HDF5 速度非常快,支持自定义属性,易于使用,但是无法存储Python对象。 相比没有优势
- 此外,HDF5 本身会序列化 NumPy 数据,因此,恕我直言,NumPy 与 HDF5 Google Protocol Buffers 支持 自描述也非常快(但目前 Python 支持很差,缓慢且有错误)。可以存储动态数据。缺点 - 自描述在 Python 中不起作用,并且 >= 1 MB 的消息序列化/反序列化速度不是很快(读作“慢”)。
PS:我需要传输的数据是NumPy/SciPy的“工作结果”(数组、复杂结构数组等)
UPD:需要跨语言访问(C/C++/Python)
I need help to make decision. I have a need to transfer some data in my application and have to make a choice between these 3 technologies.
I've read about all technologies a little bit (tutorials, documentation) but still can't decide...
How do they compare?
I need support of metadata (capability to receive file and read it without any additional information/files), fast read/write operations, capability to store dynamic data will be a plus (like Python objects)
Things I already know:
- NumPy is pretty fast but can't store dynamic data (like Python objects). (What about metadata?)
- HDF5 is very fast, supports custom attributes, is easy to use, but can't store Python objects.
Also HDF5 serializes NumPy data natively, so, IMHO, NumPy has no advantages over HDF5 - Google Protocol Buffers support self-describing too, are pretty fast (but Python support is poor at present time, slow and buggy). CAN store dynamic data. Minuses - self-describing don't work from Python and messages that are >= 1 MB are serializing/deserializing not very fast (read "slow").
PS: data I need to transfer is "result of work" of NumPy/SciPy (arrays, arrays of complicated structs, etc.)
UPD: cross-language access required (C/C++/Python)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的问题似乎确实有点矛盾 - 您希望能够存储 Python 对象,但您也希望 C/C++ 访问。我认为无论您选择哪种选择,您都需要将精美的 Python 数据结构转换为更静态的结构,例如数组。
如果您需要跨语言访问,我建议使用 HDF5,因为它是一种专门设计为独立于语言、操作系统、系统架构的文件格式(例如,加载时它可以自动在大端和小端之间转换) )并且专门针对进行科学/数值计算的用户。我对 Google Protocol Buffers 不太了解,所以我不能对此发表太多评论。
如果您决定使用 HDF5,我还建议您使用 h5py 而不是 pytables。这是因为 pytables 创建的 HDF5 文件包含大量额外的 Python 元数据,这使得读取 C/C++ 中的数据变得更加痛苦,而 h5py 不会创建任何这些额外的内容。您可以找到比较 这里,他们还提供了 pytables 常见问题解答的链接,以了解他们对此事的看法,以便您可以决定哪个最适合您的需求。
另一种与 HDF5 非常相似的格式是 NetCDF。这也有 Python 绑定,但是我没有使用这种格式的经验,所以除了指出它存在并且也广泛用于科学计算之外,我无法真正发表评论。
There does seem to be a slight contradiction in your question - you want to be able to store Python objects, but you also want C/C++ access. I think that regardless of which choice you go with, you will need to convert your fancy Python data structures into more static structures such as arrays.
If you need cross-language access, I would suggest using HDF5 as it is a file format which is specifically designed to be independent of language, operating system, system architecture (e.g. on loading it can convert between big-endian and little-endian automatically) and is specifically aimed at users doing scientific/numerical computing. I don't know much about Google Protocol Buffers, so I can't really comment too much on that.
If you decide to go with HDF5, I would also recommend that you use h5py instead of pytables. This is because pytables creates HDF5 files with a whole lot of extra pythonic metadata which makes reading the data in C/C++ a bit more of a pain, whereas h5py doesn't create any of these extras. You can find a comparison here, and they also give a link to the pytables FAQ for their view on the matter so you can decide which suits your needs best.
Another format which is very similar to HDF5 is NetCDF. This also has Python bindings, however I have no experience in using this format so I cannot really comment beyond pointing out that it exists and is also widely used in scientific computing.
我不了解 HDF5,但您可以在 NumPy 数组中存储 Python 对象,只是因为不允许在数组上执行 C 级操作而失去了所有重要功能。
I don't know about HDF5, but you can store Python objects in NumPy arrays, you just lose all the important functionality by disallowing C-level operations to be performed on the array.