在 Python 中存储和重新加载大型多维数据集

发布于 2024-10-18 10:38:51 字数 1637 浏览 1 评论 0原文

我将运行大量模拟,生成大量需要存储并稍后再次访问的数据。我的模拟程序的输出数据被写入文本文件(每个模拟一个)。我计划编写一个 Python 程序来读取这些文本文件,然后以更方便以后分析的格式存储数据。经过大量搜索后,我认为我遇到了信息过载的问题,因此我将这个问题提交给 Stack Overflow 寻求一些建议。详细信息如下:

我的数据基本上采用多维数组的形式,其中每个条目看起来像这样:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

每个参数大致具有以下数量的潜在值:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

但请注意,数据集将是稀疏的。例如,对于 stringArg1 的给定值,将仅填充 stringArg2 的大约 16 个值。此外,对于 (stringArg1, stringArg2) 的给定组合,将填充 intArg1 的大约 5000 个值。第三个和第四个字符串参数是总是完全填满。

因此,有了这些数字,我的数组将大约有 50*16*6*24*5000 = 576,000,000 个结果列表。

我正在寻找存储此数组的最佳方法,以便我可以保存它并稍后重新打开它以添加更多数据、更新现有数据或查询现有数据进行分析。到目前为止,我已经研究了三种不同的方法:

  1. 关系数据库

  2. PyTables

  3. 使用元组作为字典键的Python字典(使用pickle来保存 使用元组

我在所有三种方法中都遇到一个问题,我总是最终存储 (stringArg1, stringArg2) 的每个元组组合、 stringArg3、 stringArg4、 intArg1),可以作为表中的字段,也可以作为 Python 字典中的键。从我(可能是天真的)的角度来看,这似乎没有必要。如果这些都是整数参数,那么它们只会形成数组中每个数据条目的地址,并且不需要将所有潜在的地址组合存储在单独的字段中。例如,如果我有一个 2x2 数组 = [[100, 200] , [300, 400]],您将通过询问地址数组 [0][1] 处的值来检索值。您不需要将所有可能的地址元组 (0,0) (0,1) (1,0) (1,1) 存储在其他地方。所以我希望找到解决这个问题的方法。

我希望能够在 PyTables 中定义一个表,其中第一个表中的单元格包含其他表。例如,顶级表将有两列。第一列中的条目将是 stringArg1 的可能值。第二列中的每个条目都是一个表。这些子表将有两列,第一列是 stringArg2 的所有可能值,第二列是子子表的另一列......

这种解决方案将很容易浏览和查询(特别是如果我可以使用 ViTables 浏览数据)。问题是 PyTables 似乎不支持让一个表的单元格包含其他表。所以我似乎已经走进了死胡同。

我一直在阅读数据仓库和星型模式方法,但看起来您的事实表仍然需要包含每个可能的参数组合的元组。

好吧,这就是我现在的处境。任何和所有的建议将非常感激。此时我已经搜索得太多了,以至于我的大脑受伤了。我想是时候请教一下专家了。

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:

My data will basically take the form of a multidimensional array where each entry will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has roughly the following numbers of potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.

So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.

I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:

  1. a relational database

  2. PyTables

  3. Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)

There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.

What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...

That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.

I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.

Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

想挽留 2024-10-25 10:38:51

为什么不使用一个大表来保存所有 5 亿条条目呢?如果您使用动态压缩(此处推荐使用 Blosc 压缩器),则大多数重复条目都将被删除,因此存储开销将保持在最低限度。我建议尝试一下;有时简单的解决方案效果最好;-)

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

你げ笑在眉眼 2024-10-25 10:38:51

基本的 6 表方法不适用是否有原因?

即表 1-5 将是定义每个字段的有效值的单列表,然后最终表将是定义实际存在的条目的 5 列表。

或者,如果如您所描述的那样,第三个和第四个字符串值的每个值始终存在,则第 6 个表可能仅包含 3 列(string1、string2、int1),并且您可以通过笛卡尔连接动态生成 string3 和 string4 的组合。

Is there a reason the basic 6 table approach doesn't apply?

i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.

Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.

遇见了你 2024-10-25 10:38:51

我不完全确定您在这里要做什么,但看起来您正在尝试创建一个(可能)稀疏多维数组。因此,我不会详细介绍如何解决您的具体问题,但我知道处理此问题的最佳软件包是 Numpy Numpy 。 Numpy 可以

用作通用数据的高效多维容器。可以定义任意数据类型。这使得 NumPy 能够无缝、快速地与各种数据库集成。

我多次使用 Numpy 进行模拟数据处理,它提供了许多有用的工具,包括轻松的文件存储/访问。

希望您能在非常容易阅读的文档中找到一些内容:

带有示例的 Numpy 文档

I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can

be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.

Hopefully you'll find something in it's very easy to read documentation:

Numpy Documentation with Examples

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文