当前位置：文江博客话题详情

搜索 HDF5 数据集

发布于 2024-08-10 04:23:50 字数 246 浏览 11 评论 0原文

我目前正在探索 HDF5。我已阅读主题“评估 HDF5”中的有趣评论，并且我了解 HDF5 是存储数据的首选解决方案，但是如何查询它呢？例如，假设我有一个包含一些标识符的大文件：有没有办法快速知道文件中是否存在给定的标识符？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小…楫夜泊 2024-08-17 04:23:50

我想答案是“不直接”。

以下是我认为您可以实现该功能的一些方法。

使用组：

可以以基数树<的形式使用组的层次结构< /a> 存储数据。但这可能无法很好地扩展。

使用索引数据集：

HDF 有一个引用类型，可用于从单独的索引表链接到主表。写入主要数据后，可以使用按其他键排序并带有引用的其他数据集。例如：

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

为了使用上面的内容，在索引表中查找字段时必须编写二分搜索。

内存中索引：

根据数据集的大小，使用内存中索引可能同样容易，该索引使用“boost::serialize”之类的方法读取/写入其自己的数据集。

HDF5-FastQuery：

这篇论文（以及此页面）描述了位图索引的使用对 HDF 数据集执行复杂的查询。我没试过这个。

I think the answer is "not directly".

Here are some of the ways I think you could achieve the functionality.

Use groups:

A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn't scale too well though.

Use index datasets:

HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

In order to use the above a binary search will have to be written when looking up the field in the Index tables.

In memory Index:

Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like "boost::serialize".

HDF5-FastQuery:

This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.

回复收藏 0 原文

波浪屿的海角声 2024-08-17 04:23:50

H5Lexists 在 HDF5 1.8.0 中为此引入：

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

您还可以迭代 HDF5 文件中的内容与H5Literate：

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

但您也可以通过尝试打开数据集来手动检查以前的版本。我们使用这样的代码来处理任何版本的 HDF5：

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}

H5Lexists was introduced for this in HDF5 1.8.0:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

You can also iterate over the things that are in an HDF5 file with H5Literate:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

But you can also manually check for previous versions by trying to open a dataset. We use code like this to deal with any version of HDF5:

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}

回复收藏 0 原文

哭了丶谁疼 2024-08-17 04:23:50

也许这篇论文会对您很有帮助。
http://www.cse.ohio-state.edu/~wayi /papers/HDF5_SQL.pdf

这是您需要的吗？您可以使用 SQL（一种声明性语言）查询 HDF5 数据。

与FastQuery不同，本工作中没有索引，但我们组也提供了带有位图索引的开源版本。

而且，如果你想实时完成查询（尤其是聚合），你应该考虑近似聚合或在线聚合。我还开发了一些直接在 HDF5 上工作的产品。

此外，HDF5 上的某些查询可能比您在关系数据库中看到的复杂得多。有些查询是面向数组的而不是面向关系表的。只需谷歌“SciQL”，您就可以找到一些复杂且独特的基于数组的数据模型的查询类型，这当然可以应用于HDF5。您需要执行此类查询吗？我还开发了一个产品来支持一些复杂的查询类型。

回复收藏 0 原文

扬花落满肩 2024-08-17 04:23:50

标识符是什么意思？如果您指的是属性，请查看本教程。在C中：

status = H5Aread(attr_id, mem_type_id, buf);
status = H5Awrite(attr_id, mem_type_id, buf);

What do you mean by identifier ? If you mean an attribute, check this tutorial. In C:

status = H5Aread(attr_id, mem_type_id, buf);
status = H5Awrite(attr_id, mem_type_id, buf);

回复收藏 0 原文

~没有更多了~

关于作者

路还长，别太狂

暂无简介

文章

881 人气

关注发私信

友情链接

文江博客

搜索 HDF5 数据集

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

搜索 HDF5 数据集

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。