如果添加h5py属性值，为什么h5文件的大小没有改变？

发布于 2025-01-11 00:34:32 字数 402 浏览 2 评论 0原文

例如，我使用数据集创建了一个 h5 文件。然后我向数据集添加了一个属性。为什么文件的大小没有改变？创建数据集时属性的存储是否自动分配？

将打印以下代码：2848 2848 0

with h5py.File('dump.h5', 'a') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize('dump.h5')
with h5py.File('dump.h5', 'a') as fid:
    fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize('dump.h5')
print(s1, s2, s2-s1)

原文

For example, I created an h5 file with a dataset.
I then added one attribute to the dataset. Why the size of the file is not changed? Are attributes' storage automatically allocated when the dataset is created?

The following code will print: 2848 2848 0

with h5py.File('dump.h5', 'a') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize('dump.h5')
with h5py.File('dump.h5', 'a') as fid:
    fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize('dump.h5')
print(s1, s2, s2-s1)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

横笛休吹塞上声 2025-01-18 00:34:32

我的回答扩展了@Homer512 的评论。属性最初是为了保存“少量”数据（也称为元数据）。通常这些是标量（字符串、整数、浮点数），但也可以是更大的对象（例如，np.arrays）。 HDF 组认为最大属性大小为 64K 字节。最初，属性存储在对象的标头中（称为“紧凑存储”）。随着时间的推移，HDF 组向 HDF5 库添加了两种存储“大”属性（> 64K 字节）的新方法：1）密集属性存储（在版本 1.8 中添加）或 2）作为单独的数据集（使用对象引用））。此外，随着属性数量的增加，与属性相关的性能也会降低。在这些情况下，可以使用密集属性存储来提高性能。

话虽如此，大多数时候您不必担心存储机制。

为了演示，我扩展了您的示例以显示 2 种不同的行为：1）添加更多属性以增加所需空间，2）创建一个以空 10x10 数组作为属性的组（并且没有创建 10x10 数据集）。运行下面的代码，您将看到执行此操作时文件大小增加。

h5_file = 'dump.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize(h5_file)
with h5py.File(h5_file, 'a') as fid:
    fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

h5_file = 'dump1.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize(h5_file)
attr_dict = {'Version': 3.0, 
             'Date created': '2022-03-02',
             'Description': 'Example with several attributes',
             'Creator': 'kcw78' }
with h5py.File(h5_file, 'a') as fid:
    for k,v in attr_dict.items():
        fid['data'].attrs[k] = v
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

h5_file = 'dump2.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_group('group1')
s1=os.path.getsize(h5_file)
with h5py.File(h5_file, 'a') as fid:
    fid['group1'].attrs.modify('arr', np.zeros([10, 10]))
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

输出是：

Sizes for dump.h5:
2848, 2848, 0

Sizes for dump1.h5:
2848, 8992, 6144

Sizes for dump2.h5:
1832, 2776, 944

My answer expands on comments by @Homer512. Attributes were originally intended to save "small bits" of data (aka meta-data). Typically these are scalars (strings, ints, floats), but can be larger objects (for example, np.arrays). The HDF Group considers the maximum attribute size to be 64K bytes. Initially, attributes are stored in the object's header (called "compact storage"). Over time, The HDF Group added two new ways of storing "large" attributes (>64K bytes) to the HDF5 library: 1) dense attribute storage (added in version 1.8) or 2) as a separate dataset (using an object reference). Also, attribute-related performance slows as the number of attributes grows. Dense attribute storage can be used to improve performance in these situations.

All that said, most of the time you don't have to worry about the storage mechanism.

To demonstrate, I extended your example to show 2 different behaviors: 1) add more attributes to increase space required, and 2) creat a group with the empty 10x10 array as an attribute (and didn't create the 10x10 dataset). Run the code below and you will see the file size increase when you do this.

h5_file = 'dump.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize(h5_file)
with h5py.File(h5_file, 'a') as fid:
    fid['data'].attrs.modify('pi', np.string_("3.1415926"))
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

h5_file = 'dump1.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_dataset('data', data=np.zeros([10, 10]))
s1=os.path.getsize(h5_file)
attr_dict = {'Version': 3.0, 
             'Date created': '2022-03-02',
             'Description': 'Example with several attributes',
             'Creator': 'kcw78' }
with h5py.File(h5_file, 'a') as fid:
    for k,v in attr_dict.items():
        fid['data'].attrs[k] = v
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

h5_file = 'dump2.h5'
with h5py.File(h5_file, 'w') as fid:
    fid.create_group('group1')
s1=os.path.getsize(h5_file)
with h5py.File(h5_file, 'a') as fid:
    fid['group1'].attrs.modify('arr', np.zeros([10, 10]))
s2=os.path.getsize(h5_file)
print(f'Sizes for {h5_file}:\n{s1}, {s2}, {s2-s1}\n')

Output is:

Sizes for dump.h5:
2848, 2848, 0

Sizes for dump1.h5:
2848, 8992, 6144

Sizes for dump2.h5:
1832, 2776, 944

回复收藏 0 原文

傲性难收 2025-01-18 00:34:32

如果您查看 HDF5 规范，您会发现属性存储在对象标头（直到标头空间不足并分配延续堵塞）。因此，据推测，您的属性已写入预先分配的空间中。尝试编写较大的属性或许多较小的属性，直到超过合理的限制（例如 64 kiB），然后查看它是否会发生变化。

此外，空间是从磁盘堆（或者准确地说是多个堆）分配的。这也使得文件大小的改变不那么直接。

回复收藏 0 原文

~没有更多了~

关于作者

第七度阳光i

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如果添加h5py属性值，为什么h5文件的大小没有改变？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果添加h5py属性值，为什么h5文件的大小没有改变？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果添加h5py属性值，为什么h5文件的大小没有改变？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。