评估 HDF5:HDF5 为数据建模提供哪些限制/功能?

发布于 2024-07-13 19:36:56 字数 819 浏览 8 评论 0原文

我们正在评估用于存储在 C/C++ 代码分析过程中收集的数据的技术。 对于 C++,数据量可能相对较大,每个 TU 约为 20Mb。

阅读以下内容后答案这让我认为 HDF5 可能是适合我们使用的技术。 我想知道这里的人是否可以帮助我回答一些最初的问题:

  1. 性能。 数据的一般用法是写入一次并读取“几次”,类似于编译器生成的“.o”文件的生命周期。 HDF5 与使用 SQLite DB 之类的东西相比如何? 这是否是一个合理的比较?

  2. 随着时间的推移,我们将添加我们正在存储的信息,但不一定要重新分发一组全新的“阅读器”来支持新格式。 阅读用户指南后,我了解到 HDF5 类似于 XML 或 DB,因为信息与标签/列相关联,因此构建用于读取旧结构的工具只会忽略它不关心的字段? 我对此的理解正确吗?

  3. 我们希望写出的信息的很大一部分是树形结构:作用域层次结构、类型层次结构等。理想情况下,我们将作用域建模为具有父项、子项等。是否可以有一个 HDF5 对象“指向”另一个? 如果没有,是否有标准技术可以使用 HDF5 解决此问题? 或者,正如数据库中所要求的那样,我们是否需要一个唯一的键,在搜索数据时通过适当的查找将一个对象“链接”到另一个对象?

非常感谢!

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.

After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:

  1. Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?

  2. Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?

  3. A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?

Many thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

擦肩而过的背影 2024-07-20 19:36:56

HDF5 与使用 SQLite DB 之类的东西相比如何?
这是一个合理的比较吗?

有点相似但又不完全一样。 它们都是结构化文件。 SQLite 具有支持使用 SQL 进行数据库查询的功能。 HDF5 具有支持大型科学数据集的功能。

它们都意味着高性能。

随着时间的推移,我们将添加我们正在存储的信息,但不一定要重新分发一组全新的“阅读器”来支持新格式。

如果以结构化形式存储数据,这些结构的数据类型也会存储在 HDF5 文件中。 我对其工作原理有点生疏(例如,如果它包含固有的向后兼容性),但我确实知道,如果您正确设计“阅读器”,它应该能够处理将来更改的类型。

是否可以让一个 HDF5 对象“指向”另一个对象?

绝对地! 您需要使用属性。 每个对象都有一个或多个字符串来描述到达该对象的路径。 HDF5 类似于文件夹/目录,不同之处在于folders/目录是分层的=唯一的路径描述每个目录的位置(至少在没有硬链接的文件系统中),而组形成可以包含循环的有向图。 我不确定是否可以将指向对象的“指针”直接存储为属性,但您始终可以将绝对/相对路径存储为字符串属性。 (或者作为字符串的任何其他地方;如果您愿意,您可以拥有大量的查找表。)

How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?

Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.

They're both meant to be high performance.

Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.

If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.

Is it possible to have one HDF5 object "point" to another?

Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)

小红帽 2024-07-20 19:36:56

我们在我的项目中生成HDF5数据,但我通常不直接处理它。 我可以尝试一下前两个问题:

  1. 我们使用一次写入,多次读取的模型,并且格式似乎可以很好地处理这个问题。 我知道一个项目曾经同时写入 Oracle 数据库和 HDF5。 最终他们删除了 Oracle 输出,因为性能受到影响并且没有人使用它。 显然,SQLite 不是 Oracle,但 HDF5 格式更适合该任务。 基于这一数据点,RDBMS 可能会更好地针对多次插入和更新进行调整。

  2. 当我们添加新数据类型时,我们的客户使用的阅读器非常强大。 有些变化是预料之中的,但我们不必担心在添加更多数据字段时会破坏事物。 我们的 DBA 最近编写了一个 Python 程序来读取 HDF5 数据并填充 KMZ 文件以在 Google Earth 中进行可视化。 由于这是他用来学习 Python 的一个项目,我想说构建阅读器并不难。

关于第三个问题,我将屈服于 Jason S 的卓越知识

我认为 HDF5 是一个完全合理的选择,特别是如果您已经对它感兴趣或计划为科学界生产一些东西。

We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:

  1. We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.

  2. The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.

On the third question, I'll bow to Jason S's superior knowledge.

I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文