当前位置：文江博客话题详情

HDF5将数据集标记为其他数据集中的事件

发布于 2025-02-08 08:51:09 字数 508 浏览 2 评论 0原文

我正在从各种计算机上抽样时间序列数据，并且每个机器通常都需要从另一个设备收集大型高频爆发数据并将其附加到时间序列数据中。

想象一下，我正在测量温度随着时间的流逝，然后每10度升高，我在200kHz处采样了一个微型，我希望能够将大量的微数据标记为时间序列数据中的时间戳。甚至以人物的形式。

我试图用regionRef来做到这一点，但是正在努力寻找优雅的解决方案。而且我发现自己在熊猫商店和H5py之间在杂耍，这感觉很混乱。

最初，我认为我可以从爆发数据中制作单独的数据集，然后使用参考或时间序列数据中的时间戳链接。但是到目前为止没有运气。

将大量数据引用到另一堆数据中的时间戳的任何方法将不胜感激！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天气好吗我好吗 2025-02-15 08:51:09

如何使用区域参考？我假设您有一系列参考文献，参考文献在一系列“标准率”和“爆发率”数据之间交替。这是一种有效的方法，它将起作用。但是，您是正确的：创建是凌乱的，而恢复数据是凌乱的。

虚拟数据集可能是一个更优雅的解决方案。...但是跟踪和创建虚拟布局定义也可能会变得混乱。 :-)但是，一旦拥有虚拟数据集，就可以使用典型的切片表示法读取它。 HDF5/H5PY处理封面下的所有内容。

为了证明，我创建了一个“简单”的示例（实现虚拟数据集并非“简单”）。就是说，如果您可以找出区域参考，则可以找出虚拟数据集。这是指向

定义虚拟布局：这是将指向其他数据集的虚拟数据集的形状和dtype。
定义虚拟来源。每个都是对HDF5文件和数据集的引用（File/DataSet组合的1个虚拟源。）
映射虚拟 source data to virtual layout （您可以使用slice表示法，这在我的示例中显示）。
重复步骤2和3对于所有源（或源切片）

注意：虚拟数据集可以在单独的文件中，也可以在与引用数据集的同一文件中。我将在示例中显示这两个。（一旦定义了布局和来源，这两种方法都同样容易。）

至少还有3个其他关于此主题的问题和答案：

H5PY，ENUMS和VIRTUALLAYOUT
h5py错误读取虚拟数据集中的numpy阵列
//stackoverflow.com/a/718628888/10462884">如何将多个HDF5文件组合到一个文件和数据集中？

示例如下：
步骤1 ：创建一些示例数据。没有您的模式，我猜想您如何存储“标准率”和“爆发率”数据。所有标准费率数据均存储在数据集'data_log'中，每个爆发都存储在名为：'sturp_log _ ##'的单独数据集中。

import numpy as np
import h5py

log_ntimes = 31
log_inc = 1e-3

arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
    time = i*log_inc
    arr[i,0] = time
    #temp = 70.+ 100.*time
    #print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')

arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)

with h5py.File('SO_72654160.h5','w') as h5f:
    h5f.create_dataset('data_log',data=arr)

n_bursts = 4    
burst_ntimes = 11
burst_inc = 5e-5

for n in range(1,n_bursts):
    arr = np.zeros((burst_ntimes-1,2))
    for i in range(1,burst_ntimes):
        burst_time = 0.01*(n)
        time = burst_time + i*burst_inc
        arr[i-1,0] = time
        #temp = 70.+ 100.*t
    arr[:,1] = 70.+ 100.*arr[:,0]    
    
    with h5py.File('SO_72654160.h5','a') as h5f:
        h5f.create_dataset(f'burst_log_{n:02}',data=arr)

步骤2 ：这是定义虚拟布局和源来创建虚拟数据集的地方。这将创建一个虚拟数据集一个新文件，而在现有文件中创建一个。（声明除了文件名和模式外，语句是相同的。）

source_file = 'SO_72654160.h5'

a0 = 0
with h5py.File(source_file, 'r') as h5f:
    for ds_name in h5f:
        a0 += h5f[ds_name].shape[0]

print(f'Total data rows in source = {a0}')

# alternate getting data from
#   dataset: data_log, get rows 0-11, 11-21, 21-31 
#   datasets: burst_log_01, burst log_02, etc (each has 10 rows)

# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)

# Map virstual dataset to logged data 
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2

layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2

layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
   
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')

# Open EXISTING file, then add virtual dataset 
with h5py.File('SO_72654160.h5', 'a') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.

Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.

To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:

Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)

Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)

There are at least 3 other SO questions and answers on this topic:

Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.

import numpy as np
import h5py

log_ntimes = 31
log_inc = 1e-3

arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
    time = i*log_inc
    arr[i,0] = time
    #temp = 70.+ 100.*time
    #print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')

arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)

with h5py.File('SO_72654160.h5','w') as h5f:
    h5f.create_dataset('data_log',data=arr)

n_bursts = 4    
burst_ntimes = 11
burst_inc = 5e-5

for n in range(1,n_bursts):
    arr = np.zeros((burst_ntimes-1,2))
    for i in range(1,burst_ntimes):
        burst_time = 0.01*(n)
        time = burst_time + i*burst_inc
        arr[i-1,0] = time
        #temp = 70.+ 100.*t
    arr[:,1] = 70.+ 100.*arr[:,0]    
    
    with h5py.File('SO_72654160.h5','a') as h5f:
        h5f.create_dataset(f'burst_log_{n:02}',data=arr)

Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)

source_file = 'SO_72654160.h5'

a0 = 0
with h5py.File(source_file, 'r') as h5f:
    for ds_name in h5f:
        a0 += h5f[ds_name].shape[0]

print(f'Total data rows in source = {a0}')

# alternate getting data from
#   dataset: data_log, get rows 0-11, 11-21, 21-31 
#   datasets: burst_log_01, burst log_02, etc (each has 10 rows)

# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)

# Map virstual dataset to logged data 
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2

layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2

layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
   
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')

# Open EXISTING file, then add virtual dataset 
with h5py.File('SO_72654160.h5', 'a') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

回复收藏 0 原文

~没有更多了~