hdf5 和 ndarray 附加/大型数据集的省时方法

发布于 2024-10-24 05:28:43 字数 721 浏览 1 评论 0原文

背景

我有 ak 个 n 维时间序列，每个序列表示为包含浮点值的 mx (n+1) 数组（n 列加上代表日期的一列）。

示例：

k 个（大约 400 万个）时间序列，看起来像

20100101    0.12    0.34    0.45    ...
20100105    0.45    0.43    0.21    ...
...         ...     ...     ...

每天，我想为数据集的子集（< k）添加一个附加行。所有数据集都按组存储在一个 hd5f 文件中。

问题

将行附加到数据集的最省时的方法是什么？

输入是一个 CSV 文件，看起来

key1, key2, key3, key4, date, value1, value2, ...

日期对于特定文件来说是唯一的，可以忽略。我有大约 400 万个数据集。问题是我必须查找键、获取完整的 numpy 数组、调整数组大小、添加行并再次存储数组。 hd5f 文件的总大小约为 100 GB。知道如何加快速度吗？我想我们可以同意，使用 SQLite 或类似的东西是行不通的——一旦我拥有了所有数据，平均数据集将拥有超过 100 万个元素乘以 400 万个数据集。

谢谢！

原文

Background

I have a k n-dimensional time-series, each represented as m x (n+1) array holding float values (n columns plus one that represents the date).

Example:

k (around 4 million) time-series that look like

20100101    0.12    0.34    0.45    ...
20100105    0.45    0.43    0.21    ...
...         ...     ...     ...

Each day, I want to add for a subset of the data sets (< k) an additional row. All datasets are stored in groups in one hd5f file.

Question

What is the most time-efficient approach to append the rows to the data sets?

Input is a CSV file that looks like

key1, key2, key3, key4, date, value1, value2, ...

whereby date is unique for the particular file and could be ignored. I have around 4 million data sets. The issue is that I have to look-up the key, get the complete numpy array, resize the array, add the row and store the array again. The total size of the hd5f file is around 100 GB. Any idea how to speed this up?
I think we can agree that using SQLite or something similar doesn't work - as soon as I have all the data, an average data set will have over 1 million elements times 4 million data sets.

Thanks!

分享到QQ

分享到微博