h5py：read_direct到多维numpy阵列返回value error

发布于 2025-02-01 15:12:55 字数 1466 浏览 5 评论 0原文

我正在使用read_direct将大量向量从H5文件复制到单个2D numpy数组中。很大的是数百万分。 Read_Direct显然比切片符号要快，因为它避免了中间副本。

我的第一个尝试是：

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data, source_sel=None, dest_sel=np.s_[index, :])
            index += 1
    return data.mean(axis=0)

这返回一个错误：

value error：2个索引参数1维度

参数为value.read_direct行。我不明白的是为什么它给我这个错误。数据阵列为2D，因此给出2D索引似乎是完全敏感的。如果我更改为dest_sel = np.s _ [：]每个数据集将被复制到第一行数据中，这显然不是我想要的。

解决以下工作是：

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data[index, :], source_sel=None, dest_sel=None)
            index += 1
    return data.mean(axis=0)

这起作用，但我不明白为什么第一次尝试不进行。

不幸的是，我尝试使用KCW78的答案，

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data,
                              source_sel=None,
                              dest_sel=np.s_[index:index+1, :])
            index += 1
    return data

它给出了与我的第一次尝试相同的KeyError。

原文

I am using read_direct to copy large vectors from an h5 file into a single 2D numpy array. Large being millions of points. read_direct is apparently faster than slicing notation because it avoids an intermediate copy.

My first attempt was:

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data, source_sel=None, dest_sel=np.s_[index, :])
            index += 1
    return data.mean(axis=0)

This returns an error, though:

ValueError: 2 indexing arguments for 1 dimensions

The line given is the value.read_direct line. What I don't understand is why it is giving me this error. The data array is 2D, so giving it a 2D index seems perfectly sensible. If I change to dest_sel=np.s_[:] every dataset will be copied into the first row of data, which is obviously not what I want.

A work around is to do the following:

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data[index, :], source_sel=None, dest_sel=None)
            index += 1
    return data.mean(axis=0)

This works, but I do not understand why the first attempt doesn't.

Working with kcw78's answer, I tried this

def _harvest_data(grp: h5py.Group) -> np.array:
    data = np.zeros((64, grp['times'].shape[0]))
    index = 0
    for key, value in grp.items():
        if 'X' in key:
            value.read_direct(data,
                              source_sel=None,
                              dest_sel=np.s_[index:index+1, :])
            index += 1
    return data

It gives the same KeyError as my first attempt, unfortunately.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光无声 2025-02-08 15:12:55

Note ：回答更新的May-27-2022，其中包括一个示例，该示例从多个1-D数据集读取到2-D阵列（更紧密地模拟OP的工作流程）。

作为参考，我正在使用H5PY .__版本__ '3.3.0'

在np.s _的numpy文档中的注释引起了我的注意。它说：

index_exp: Predefined instance that always returns a tuple: 
index_exp = IndexExpression(maketuple=True).

s_: Predefined instance without tuple conversion: 
s_ = IndexExpression(maketuple=False).

所以，最初，我认为您需要使用np.index_exp而不是np.s _才能获得H5PY的始终切片。但是，一些测试表明它稍微复杂一些。我无法解释基本的numpy代码...但这将导致解决方案。

In [4]: np.s_[:,5]
Out[4]: (slice(None, None, None), 5)

In [5]: np.index_exp[:,5]
Out[5]: (slice(None, None, None), 5)

In [6]: np.s_[:,5:6]
Out[6]: (slice(None, None, None), slice(5, 6, None))

这是一个新示例，可以演示从1-D数据集（shape =（10，））到2-D阵列的阅读（shape =（5,10） ）。它读取从2个不同数据集到数组的第一行和最后一行的值。它无用于源切片。测试了三种不同的指定目的地切片的方式。所有3个工作。

with h5py.File('h5py_example_1d.h5','w') as h5f:
    arr_in = np.arange(10)
    # Create 5 1-d datasets with shape (10,)
    for i in range(5):
        h5f.create_dataset(f'dset_{i}', data=(arr_in+10.*i))

    arr_outr = np.zeros((5,10))
    # Read entire 'dset_0' dataset to 1st row of the array
    arr_outr[0,:] = h5f['dset_0'] # Read entire dataset, don't need to slice
    print(arr_outr)

    source = None
    destin = np.s_[4:5,:] # dest = np.s_[4:5,] np.s_[4,] also work
    print(f'source: {source}')
    print(f'dest: {destin}')
    # Read entire 'dset_4' dataset to 5th row of the array
    h5f['dset_4'].read_direct(arr_outr, source, destin)
    print(arr_outr)

这是从2-D数据集到2-D数组的上一个示例（两者= 100x100）。

with h5py.File('read_direct_example.h5','w') as h5f:
    arr_in = np.arange(10*10).reshape(10,10)
    dset = h5f.create_dataset("dset", data=arr_in)
    arr_out = np.zeros((10,10))

    source = np.s_[:,5:6]
    dest = np.s_[:,0:1]
    print(f'source: {source}')
    print(f'dest: {dest}')

    arr_out[dest] = dset[source]
    print(arr_out)

    dest   = np.s_[:,9:10]
    print(f'dest: {dest}')
    dset.read_direct(arr_out, source, dest)
    print(arr_out)

arr_out的输出来自这两种方法：（1）用切片符号读取数据集，以及（2）使用read_direct（）带有np.np.s _ ）

[[ 5.  0.  0.  0.  0.  0.  0.  0.  0.  5.]
 [15.  0.  0.  0.  0.  0.  0.  0.  0. 15.]
 [25.  0.  0.  0.  0.  0.  0.  0.  0. 25.]
 [35.  0.  0.  0.  0.  0.  0.  0.  0. 35.]
 [45.  0.  0.  0.  0.  0.  0.  0.  0. 45.]
 [55.  0.  0.  0.  0.  0.  0.  0.  0. 55.]
 [65.  0.  0.  0.  0.  0.  0.  0.  0. 65.]
 [75.  0.  0.  0.  0.  0.  0.  0.  0. 75.]
 [85.  0.  0.  0.  0.  0.  0.  0.  0. 85.]
 [95.  0.  0.  0.  0.  0.  0.  0.  0. 95.]]

Note: Answer updated May-27-2022 to include an example that reads from multiple 1-d datasets to a 2-d array (more closely mimics OP's workflow).

For reference, I am using h5py.__version__ '3.3.0'

A note in the numpy documentation for np.s_ caught my eye. It says:

index_exp: Predefined instance that always returns a tuple: 
index_exp = IndexExpression(maketuple=True).

s_: Predefined instance without tuple conversion: 
s_ = IndexExpression(maketuple=False).

So, initially I thought you need to use np.index_exp instead of np.s_ to get always slice tuples for h5py. However, a little testing shows it's slightly more complicated. I can't explain the underlying numpy code...but this will lead to a solution.

In [4]: np.s_[:,5]
Out[4]: (slice(None, None, None), 5)

In [5]: np.index_exp[:,5]
Out[5]: (slice(None, None, None), 5)

In [6]: np.s_[:,5:6]
Out[6]: (slice(None, None, None), slice(5, 6, None))

This is a new example to demonstrate reading from a 1-d dataset (shape=(10,)) to a 2-d array (shape=(5,10)). It reads values from 2 different datasets to first and last rows of the array. It uses None for source slice. Three different ways of specifying destination slice were tested. All 3 work.

with h5py.File('h5py_example_1d.h5','w') as h5f:
    arr_in = np.arange(10)
    # Create 5 1-d datasets with shape (10,)
    for i in range(5):
        h5f.create_dataset(f'dset_{i}', data=(arr_in+10.*i))

    arr_outr = np.zeros((5,10))
    # Read entire 'dset_0' dataset to 1st row of the array
    arr_outr[0,:] = h5f['dset_0'] # Read entire dataset, don't need to slice
    print(arr_outr)

    source = None
    destin = np.s_[4:5,:] # dest = np.s_[4:5,] np.s_[4,] also work
    print(f'source: {source}')
    print(f'dest: {destin}')
    # Read entire 'dset_4' dataset to 5th row of the array
    h5f['dset_4'].read_direct(arr_outr, source, destin)
    print(arr_outr)

This is the previous example that reads from a 2-D dataset to a 2-D array (both shape=100x100).

with h5py.File('read_direct_example.h5','w') as h5f:
    arr_in = np.arange(10*10).reshape(10,10)
    dset = h5f.create_dataset("dset", data=arr_in)
    arr_out = np.zeros((10,10))

    source = np.s_[:,5:6]
    dest = np.s_[:,0:1]
    print(f'source: {source}')
    print(f'dest: {dest}')

    arr_out[dest] = dset[source]
    print(arr_out)

    dest   = np.s_[:,9:10]
    print(f'dest: {dest}')
    dset.read_direct(arr_out, source, dest)
    print(arr_out)

Output of arr_out from both methods: (1) read the dataset with slice notation, and (2) using read_direct() with slices from np.s_)

[[ 5.  0.  0.  0.  0.  0.  0.  0.  0.  5.]
 [15.  0.  0.  0.  0.  0.  0.  0.  0. 15.]
 [25.  0.  0.  0.  0.  0.  0.  0.  0. 25.]
 [35.  0.  0.  0.  0.  0.  0.  0.  0. 35.]
 [45.  0.  0.  0.  0.  0.  0.  0.  0. 45.]
 [55.  0.  0.  0.  0.  0.  0.  0.  0. 55.]
 [65.  0.  0.  0.  0.  0.  0.  0.  0. 65.]
 [75.  0.  0.  0.  0.  0.  0.  0.  0. 75.]
 [85.  0.  0.  0.  0.  0.  0.  0.  0. 85.]
 [95.  0.  0.  0.  0.  0.  0.  0.  0. 95.]]

回复收藏 0 原文

~没有更多了~