是否可以在 PyTables 单元格中存储任意形状的多维数组?
PyTables 支持从继承 IsDescription 类的用户定义类创建表。这包括对多维单元的支持,如文档中的以下示例所示:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
但是,是否可以在单个单元中存储任意形状的多维数组?按照上面的示例,类似 Pressure = Float32Col(shape=(x, y))
的内容,其中 x
和 y
是根据插入确定的每行。
如果不是,首选方法是什么?将每个(任意形状的)多维数组存储在具有唯一名称的 CArray
中,然后将这些名称存储在主索引表中?我想象的应用程序正在存储图像和相关元数据,我希望能够查询和使用 numexpr 。
任何有关 PyTables 最佳实践的指示都非常感谢!
PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y))
where x
and y
are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray
with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr
on.
Any pointers toward PyTables best practices are much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
长的答案是“是的,但你可能不想”。
PyTables 可能不直接支持它,但 HDF5 确实支持创建嵌套的可变长度数据类型,允许多个维度的参差不齐的数组。如果您想走这条路,您需要使用 h5py 并浏览 < a href="http://www.hdfgroup.org/HDF5/doc/UG/UG_frame11Datatypes.html" rel="nofollow">HDF5 用户指南,数据类型章节。请参阅第 6.4.3.2.3 节。可变长度数据类型。 (我会链接它,但他们显然选择不将锚点放那么深)。
就我个人而言,我将您获得的数据排列成数据集组,而不是单个表。也就是说,类似:
等等。纬度和经度值将是
/articles/articlename
组上的属性而不是数据集,尽管拥有较小的数据集也完全没问题。如果您希望能够根据纬度和经度进行搜索,那么拥有一个包含纬度/经度/名称列的数据集会很好。如果您想要真正喜欢,可以参考 HDF5 数据类型,它允许您存储指向数据集甚至数据集子集的指针。
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
and so on. The lat and long values would be attributes on the
/particles/particlename
group rather than datasets, though having small datasets for them is perfectly fine too.If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
简短的回答是“不”,我认为这是 hdf5 而不是 pytables 的“限制”。
我认为原因是每个存储单元(复合数据集)必须具有明确定义的大小,如果一个或多个组件可以更改大小,则显然不会。请注意,完全可以调整 hdf5 中数据集的大小和扩展(pytables 大量使用此功能),但不能调整该数组中的数据单位。
我怀疑最好的办法是:
a) 使其具有明确定义的大小并提供溢出标志。如果最大的合理大小仍然很小并且您可以接受尾部事件被抛出,那么这种方法很有效。请注意,您也许可以通过 hdf5 压缩来利用未使用的磁盘空间。
b) 按照您的建议在同一文件中创建一个新的 CArray,只需在需要时读取该文件即可。 (为了保持整洁,您可能希望将它们全部放在自己的组下)
HDF5 实际上有 API 专为在 hdf5 文件中存储图像而设计(并优化)。我不认为它暴露在 pytables 中。
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.