在 Python 中存储和重新加载大型多维数据集

发布于 2024-10-18 10:38:51 字数 1637 浏览 1 评论 0原文

我将运行大量模拟，生成大量需要存储并稍后再次访问的数据。我的模拟程序的输出数据被写入文本文件（每个模拟一个）。我计划编写一个 Python 程序来读取这些文本文件，然后以更方便以后分析的格式存储数据。经过大量搜索后，我认为我遇到了信息过载的问题，因此我将这个问题提交给 Stack Overflow 寻求一些建议。详细信息如下：

我的数据基本上采用多维数组的形式，其中每个条目看起来像这样：

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

每个参数大致具有以下数量的潜在值：

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

但请注意，数据集将是稀疏的。例如，对于 stringArg1 的给定值，将仅填充 stringArg2 的大约 16 个值。此外，对于 (stringArg1, stringArg2) 的给定组合，将填充 intArg1 的大约 5000 个值。第三个和第四个字符串参数是总是完全填满。

因此，有了这些数字，我的数组将大约有 50*16*6*24*5000 = 576,000,000 个结果列表。

我正在寻找存储此数组的最佳方法，以便我可以保存它并稍后重新打开它以添加更多数据、更新现有数据或查询现有数据进行分析。到目前为止，我已经研究了三种不同的方法：

关系数据库
PyTables
使用元组作为字典键的Python字典（使用pickle来保存使用元组

我在所有三种方法中都遇到一个问题，我总是最终存储 (stringArg1, stringArg2) 的每个元组组合、 stringArg3、 stringArg4、 intArg1），可以作为表中的字段，也可以作为 Python 字典中的键。从我（可能是天真的）的角度来看，这似乎没有必要。如果这些都是整数参数，那么它们只会形成数组中每个数据条目的地址，并且不需要将所有潜在的地址组合存储在单独的字段中。例如，如果我有一个 2x2 数组 = [[100, 200] , [300, 400]]，您将通过询问地址数组 [0][1] 处的值来检索值。您不需要将所有可能的地址元组 (0,0) (0,1) (1,0) (1,1) 存储在其他地方。所以我希望找到解决这个问题的方法。

我希望能够在 PyTables 中定义一个表，其中第一个表中的单元格包含其他表。例如，顶级表将有两列。第一列中的条目将是 stringArg1 的可能值。第二列中的每个条目都是一个表。这些子表将有两列，第一列是 stringArg2 的所有可能值，第二列是子子表的另一列......

这种解决方案将很容易浏览和查询（特别是如果我可以使用 ViTables 浏览数据）。问题是 PyTables 似乎不支持让一个表的单元格包含其他表。所以我似乎已经走进了死胡同。

我一直在阅读数据仓库和星型模式方法，但看起来您的事实表仍然需要包含每个可能的参数组合的元组。

好吧，这就是我现在的处境。任何和所有的建议将非常感激。此时我已经搜索得太多了，以至于我的大脑受伤了。我想是时候请教一下专家了。

原文

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:

My data will basically take the form of a multidimensional array where each entry will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has roughly the following numbers of potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.

So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.

I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:

a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)

There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.

What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...

That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.

I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.

Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

分享到QQ

分享到微博