如何最好地写出 std::vector < std::string > HDF5 数据集的容器?
给定一个字符串向量,将它们写入 HDF5 数据集的最佳方法是什么? 目前我正在做类似以下的事情:
const unsigned int MaxStrLength = 512;
struct TempContainer {
char string[MaxStrLength];
};
void writeVector (hid_t group, std::vector<std::string> const & v)
{
//
// Firstly copy the contents of the vector into a temporary container
std::vector<TempContainer> tc;
for (std::vector<std::string>::const_iterator i = v.begin ()
, end = v.end ()
; i != end
; ++i)
{
TempContainer t;
strncpy (t.string, i->c_str (), MaxStrLength);
tc.push_back (t);
}
//
// Write the temporary container to a dataset
hsize_t dims[] = { tc.size () } ;
hid_t dataspace = H5Screate_simple(sizeof(dims)/sizeof(*dims)
, dims
, NULL);
hid_t strtype = H5Tcopy (H5T_C_S1);
H5Tset_size (strtype, MaxStrLength);
hid_t datatype = H5Tcreate (H5T_COMPOUND, sizeof (TempConainer));
H5Tinsert (datatype
, "string"
, HOFFSET(TempContainer, string)
, strtype);
hid_t dataset = H5Dcreate1 (group
, "files"
, datatype
, dataspace
, H5P_DEFAULT);
H5Dwrite (dataset, datatype, H5S_ALL, H5S_ALL, H5P_DEFAULT, &tc[0] );
H5Dclose (dataset);
H5Sclose (dataspace);
H5Tclose (strtype);
H5Tclose (datatype);
}
至少,我真的很想改变上面的内容,以便:
- 它使用可变长度字符串
- 我不需要有一个临时容器
我对存储方式没有限制例如,如果有更好的方法来做到这一点,它不必是COMPOUND数据类型。
编辑:为了缩小问题范围,我相对熟悉在 C++ 端处理数据,这是我需要最多帮助的 HDF5 端。
感谢您的帮助。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
[非常感谢 dirkgently 帮助回答这个问题。]
要在 HDF5 中编写可变长度字符串,请使用以下
命令 :编写容器的解决方案是单独编写每个元素。 这可以使用 hyperslabs 来实现。
例如:
//...
// ...
[Many thanks to dirkgently for his help in answering this.]
To write a variable length string in HDF5 use the following:
One solution for writing a container is to write each element individually. This can be achieved using hyperslabs.
For example:
//...
// ...
下面是一些使用 HDF5 c++ API 编写可变长度字符串向量的工作代码。
我在其他帖子中纳入了一些建议:
string::c_str()
获取指向字符串的指针vector
中code>char* 并传递给 HDF5 API没有必要创建字符串的昂贵副本(例如使用
strdup()
)。c_str()
返回指向基础字符串的 null 终止数据的指针。 这正是该函数的目的。 当然,嵌入空值的字符串将无法使用此...std::vector
保证具有连续的底层存储,因此使用vector
和vector: :data()
与使用原始数组相同,但当然比笨重、老式的 C 处理方式更简洁、更安全。Here is some working code for writing a vector of variable length strings using the HDF5 c++ API.
I incorporate some of the suggestions in the other posts:
string::c_str()
to obtain pointers to the stringsvector
ofchar*
and pass to the HDF5 APIIt is not necessary to create expensive copies of the string (e.g. with
strdup()
).c_str()
returns a pointer to the null terminated data of the underlying string. This is precisely what the function is intended for. Of course, strings with embedded nulls will not work with this...std::vector
is guaranteed to have contiguous underlying storage, so usingvector
andvector::data()
is the same as using raw arrays but is of course much neater and safer than the clunky, old-fashioned c way of doing things.如果您正在寻找更清晰的代码:我建议您创建一个函子,它将接受一个字符串并将其保存到 HDF5 容器(以所需的模式)。 Richard,我使用了错误的算法,请重新检查!
这有助于入门吗?
If you are looking at cleaner code: I suggest you create a functor that'll take a string and save it to the HDF5 Container (in a desired mode). Richard, I used the wrong algorithm, please re-check!
Does that help get started?
我遇到了类似的问题,但需要注意的是,我想要将字符串向量存储为属性。 属性的棘手之处在于我们不能使用像 hyperslab 这样的花哨的数据空间功能(至少对于 C++ API 来说)。
但无论哪种情况,将字符串向量输入到数据集中的单个条目中可能会很有用(例如,如果您总是希望一起阅读它们)。 在这种情况下,所有的魔力都来自于类型,而不是数据空间本身。
基本上有 4 个步骤:
vector
。hvl_t
结构。H5::StrType
的H5::VarLenType
。hvl_t
类型写入数据集。此方法真正好的部分是您将整个条目填充到 HDF5 认为的标量值中。 这意味着将其设为属性(而不是数据集)很简单。
无论您选择此解决方案还是选择每个字符串都在其自己的数据集条目中的解决方案,也可能与所需的性能有关:如果您正在寻找对特定字符串的随机访问,最好将字符串写在数据集中,以便它们可以被索引。 如果您总是要一起阅读它们,那么此解决方案可能也同样有效。
以下是如何使用 C++ API 和简单标量数据集执行此操作的简短示例:
I had a similar issue, with the caveat that I wanted a vector of strings stored as an attribute. The tricky thing with attributes is that we can't use fancy dataspace features like hyperslabs (at least with the C++ API).
But in either case, it may be useful to enter a vector of strings into a single entry in a dataset (if, for example, you always expect to read them together). In this case all the magic comes with the type, not with the dataspace itself.
There are basically 4 steps:
vector<const char*>
which points to the strings.hvl_t
structure that points to the vector and contains it's length.H5::VarLenType
wrapping a (variable length)H5::StrType
.hvl_t
type to a dataset.The really nice part of this method is that you're stuffing the whole entry into what HDF5 considers a scalar value. This means that making it an attribute (rather than a dataset) is trivial.
Whether you choose this solution or the one with each string in its own dataset entry is probably also a matter of the desired performance: if you're looking for random access to specific strings, it's probably better to write the strings out in a dataset so they can be indexed. If you're always going to read them all out together this solution may work just as well.
Here's a short example of how to do this, using the C++ API and a simple scalar dataset:
我参加聚会迟到了,但我根据有关段错误的评论修改了 Leo Goodstadt 的答案。 我是linux,但是没有出现这样的问题。 我编写了 2 个函数,一个将 std::string 向量写入打开的 H5File 中给定名称的数据集,另一个将结果数据集读回 std::string 向量。 请注意,类型之间可能会有几次不必要的复制,这可以进一步优化。 这是用于写入和读取的工作代码:
以及读取:
I am late to the party but I've modified Leo Goodstadt's answer based on the comments regarding segfaults. I am on linux, but I don't have such problems. I wrote 2 functions, one to write a vector of std::string to a dataset of a given name in an open H5File, and another to read back the resulting data sets into a vector of std::string. Note there may unnecessary copying between types a few times that can be more optimised. Here is working code for writing and reading:
And to read:
如你所知,hdf5文件只接受char*格式的数据,即
所以最自然的方式就是动态创建连续的地址(给定空间大小),并将向量的值复制到其中。
完整代码如下所示,
不要忘记释放指针。
As you know,hdf5 file only accept data with the format as char*, which is
an address.So the most natural way is like dynamically creating consecutive address (the space size is given), and copying the value of vector into it.
The complete code is shown as below,
Don't forget to free the pointer.
您可以使用简单的 std::vector 来代替 TempContainer (您也可以将其模板化以匹配 T -> basic_string 。
像这样的东西:
Instead of a TempContainer, you can use a simple std::vector (you could also templatized it to match T -> basic_string .
Something like this:
为了能够读取
std::vector
,我根据 Leo 的提示在这里发布了我的解决方案 https://stackoverflow.com/a/15220532/364818。我混合了 C 和 C++ API。 请随意编辑它并使其更简单。
请注意,当您调用 read 时,HDF5 API 返回一个
char*
指针列表。 这些char*
指针在使用后必须释放,否则会出现内存泄漏。用法示例
这是代码
In the interest of having the ability to read
std::vector<std::string>
I'm posting my solution, based on the hints from Leo here https://stackoverflow.com/a/15220532/364818.I've mixed C and C++ APIs. Please feel free to edit this and make it simpler.
Note that the HDF5 API returns a list of
char*
pointers when you call read. Thesechar*
pointers must be freed after use, otherwise there is a memory leak.Usage example
Here's the code
我不知道 HDF5,但您可以使用
这种方式然后复制字符串:
这将分配一个具有确切大小的字符串,并且在插入或从容器中读取时也有很大改进(在您的示例中复制了一个数组) ,在这种情况下只有一个指针)。 您还可以使用 std::vector:
I don't know about HDF5, but you can use
and then copy the strings this way:
This will allocate a string with the exact size, and also improves a lot when inserting or reading from the container (in your example there's an array copied, in this case only a pointer). You can also use std::vector: