如果添加h5py属性值,为什么h5文件的大小没有改变?
例如,我使用数据集创建了一个 h5 文件。 然后我向数据集添加了一个属性。为什么文件的大小没有改变?创建数据集时属性的存储是否自动分配?
将打印以下代码:2848 2848 0
with h5py.File('dump.h5', 'a') as fid:
fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize('dump.h5')
with h5py.File('dump.h5', 'a') as fid:
fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize('dump.h5')
print(s1, s2, s2-s1)
For example, I created an h5 file with a dataset.
I then added one attribute to the dataset. Why the size of the file is not changed? Are attributes' storage automatically allocated when the dataset is created?
The following code will print: 2848 2848 0
with h5py.File('dump.h5', 'a') as fid:
fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize('dump.h5')
with h5py.File('dump.h5', 'a') as fid:
fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize('dump.h5')
print(s1, s2, s2-s1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的回答扩展了@Homer512 的评论。属性最初是为了保存“少量”数据(也称为元数据)。通常这些是标量(字符串、整数、浮点数),但也可以是更大的对象(例如,np.arrays)。 HDF 组认为最大属性大小为 64K 字节。最初,属性存储在对象的标头中(称为“紧凑存储”)。随着时间的推移,HDF 组向 HDF5 库添加了两种存储“大”属性(> 64K 字节)的新方法:1)密集属性存储(在版本 1.8 中添加)或 2)作为单独的数据集(使用对象引用) )。此外,随着属性数量的增加,与属性相关的性能也会降低。在这些情况下,可以使用密集属性存储来提高性能。
话虽如此,大多数时候您不必担心存储机制。
为了演示,我扩展了您的示例以显示 2 种不同的行为:1)添加更多属性以增加所需空间,2)创建一个以空 10x10 数组作为属性的组(并且没有创建 10x10 数据集)。运行下面的代码,您将看到执行此操作时文件大小增加。
输出是:
My answer expands on comments by @Homer512. Attributes were originally intended to save "small bits" of data (aka meta-data). Typically these are scalars (strings, ints, floats), but can be larger objects (for example, np.arrays). The HDF Group considers the maximum attribute size to be 64K bytes. Initially, attributes are stored in the object's header (called "compact storage"). Over time, The HDF Group added two new ways of storing "large" attributes (>64K bytes) to the HDF5 library: 1) dense attribute storage (added in version 1.8) or 2) as a separate dataset (using an object reference). Also, attribute-related performance slows as the number of attributes grows. Dense attribute storage can be used to improve performance in these situations.
All that said, most of the time you don't have to worry about the storage mechanism.
To demonstrate, I extended your example to show 2 different behaviors: 1) add more attributes to increase space required, and 2) creat a group with the empty 10x10 array as an attribute (and didn't create the 10x10 dataset). Run the code below and you will see the file size increase when you do this.
Output is:
如果您查看 HDF5 规范,您会发现 属性 存储在 对象标头(直到标头空间不足并分配延续堵塞)。因此,据推测,您的属性已写入预先分配的空间中。尝试编写较大的属性或许多较小的属性,直到超过合理的限制(例如 64 kiB),然后查看它是否会发生变化。
此外,空间是从磁盘堆(或者准确地说是多个堆)分配的。这也使得文件大小的改变不那么直接。
If you look at the HDF5 specification you see that attributes are stored in the object header (until the header runs out of space and allocates a continuation block). So, presumably, your attribute was written into preallocated space. Try writing a larger attribute or many small ones until you cross a reasonable limit like 64 kiB and see if it changes then.
Also, space is allocated from an on-disk heap (or multiple heaps to be precise). So that also makes file size changes less direct.