搜索 HDF5 数据集

发布于 2024-08-10 04:23:50 字数 246 浏览 3 评论 0原文

我目前正在探索 HDF5。我已阅读主题“评估 HDF5”中的有趣评论,并且我了解 HDF5 是存储数据的首选解决方案,但是如何查询它呢?例如,假设我有一个包含一些标识符的大文件:有没有办法快速知道文件中是否存在给定的标识符?

I'm currently exploring HDF5. I've read the interesting comments from the thread "Evaluating HDF5" and I understand that HDF5 is a solution of choice for storing the data, but how do you query it ? For example, say I've a big file containing some identifiers : Is there a way to quickly know if a given identifier is present in the file ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

小…楫夜泊 2024-08-17 04:23:50

我想答案是“不直接”。

以下是我认为您可以实现该功能的一些方法。

使用组:

可以以基数树<的形式使用组的层次结构< /a> 存储数据。但这可能无法很好地扩展。

使用索引数据集

HDF 有一个引用类型,可用于从单独的索引表链接到主表。写入主要数据后,可以使用按其他键排序并带有引用的其他数据集。例如:

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

为了使用上面的内容,在索引表中查找字段时必须编写二分搜索。

内存中索引:

根据数据集的大小,使用内存中索引可能同样容易,该索引使用“boost::serialize”之类的方法读取/写入其自己的数据集。

HDF5-FastQuery:

这篇论文(以及此页面)描述了位图索引的使用对 HDF 数据集执行复杂的查询。我没试过这个。

I think the answer is "not directly".

Here are some of the ways I think you could achieve the functionality.

Use groups:

A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn't scale too well though.

Use index datasets:

HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

In order to use the above a binary search will have to be written when looking up the field in the Index tables.

In memory Index:

Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like "boost::serialize".

HDF5-FastQuery:

This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.

波浪屿的海角声 2024-08-17 04:23:50

H5Lexists 在 HDF5 1.8.0 中为此引入:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

您还可以迭代 HDF5 文件中的内容与H5Literate

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

但您也可以通过尝试打开数据集来手动检查以前的版本。我们使用这样的代码来处理任何版本的 HDF5:

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}

H5Lexists was introduced for this in HDF5 1.8.0:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

You can also iterate over the things that are in an HDF5 file with H5Literate:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

But you can also manually check for previous versions by trying to open a dataset. We use code like this to deal with any version of HDF5:

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}
哭了丶谁疼 2024-08-17 04:23:50

也许这篇论文会对您很有帮助。
http://www.cse.ohio-state.edu/~wayi /papers/HDF5_SQL.pdf

这是您需要的吗?您可以使用 SQL(一种声明性语言)查询 HDF5 数据。

与FastQuery不同,本工作中没有索引,但我们组也提供了带有位图索引的开源版本。

而且,如果你想实时完成查询(尤其是聚合),你应该考虑近似聚合或在线聚合。我还开发了一些直接在 HDF5 上工作的产品。

此外,HDF5 上的某些查询可能比您在关系数据库中看到的复杂得多。有些查询是面向数组的而不是面向关系表的。只需谷歌“SciQL”,您就可以找到一些复杂且独特的基于数组的数据模型的查询类型,这当然可以应用于HDF5。您需要执行此类查询吗?我还开发了一个产品来支持一些复杂的查询类型。

Perhaps this paper will be very helpful to you.
http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf

Is this what you need? You can query a HDF5 data with SQL, which is a declarative language.

Unlike FastQuery, there is no index in this work, but our group also provides an open source version with bitmap index.

Moreover, if you want to complete the query (especially for aggregation) in real time, you should consider approximate aggregation or online aggregation. I have also developed some products which directly work on HDF5.

Furthermore, some queries over HDF5 can be much more complex than what you may have seen in relational databases. Some queries are array-oriented rather than relational table-oriented. Just google "SciQL", then you can find some complex and unique query types for array-based data model, which can certainly be applied to HDF5. Do you need to perform those kind of queries? I have also developed a product to support some of the complicated query types there.

扬花落满肩 2024-08-17 04:23:50

标识符是什么意思?如果您指的是属性,请查看本教程 。在C中:

status = H5Aread(attr_id, mem_type_id, buf);
status = H5Awrite(attr_id, mem_type_id, buf);

What do you mean by identifier ? If you mean an attribute, check this tutorial. In C:

status = H5Aread(attr_id, mem_type_id, buf);
status = H5Awrite(attr_id, mem_type_id, buf);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文