当前位置：文江博客话题详情

比 Get[“raggedmatrix.mx”] 更快的大量数据导入？

发布于 2024-12-22 11:08:42 字数 560 浏览 2 评论 0原文

任何人都可以建议进口几个的替代方案吗来自 60 个 .mx 文件列表的 GB 数字数据（.mx 形式），每个文件大约 650 MB？

- 太大，无法在此发布 - 研究问题涉及简单的统计操作数据量（大约 34 GB）是可用 RAM（16 GB）的两倍。为了处理数据大小问题，我只是将数据拆分并使用进行数学计算的“获取/清除”策略。

它确实有效，但是调用 Get["bigfile.mx"] 需要相当长的时间，所以我想知道使用 BLOB 或 PostgreSQL 或 MySQL 或人们使用的任何数据库是否会更快GB 的数值数据。

所以我的问题确实是： 在 Mathematica 中处理真正大型数据集导入的最有效方法是什么？

我还没有尝试过，但我认为来自 DataBaseLink 的 SQLImport 会比 Get["bigfile.mx “]。

有人有一些经验可以分享吗？

（抱歉，如果这不是一个非常具体的编程问题，但这确实会帮助我继续进行耗时的找出 137 种可能性中最好的解决方案）数学中的问题）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瀞厅☆埖开 2024-12-29 11:08:42

这是一个想法：

你说你有一个参差不齐的矩阵，即不同长度的列表的列表。我假设浮点数。

您可以压平矩阵以获得单个长打包一维数组（如果需要，使用Developer`ToPackedArray来打包它），并单独存储子列表的起始索引。然后在导入数据后重建参差不齐的矩阵。

这是在 Mathematica 中（即导入后）的演示，从大的扁平列表中提取子列表的速度很快。

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

或者，存储一系列子列表长度并使用 Mr.Wizard 的 dynamicPartition 函数来完成此操作。我的观点是，以平面格式存储数据并在内核中对其进行分区所增加的开销可以忽略不计。

将打包数组导入为 MX 文件非常快。我只有 2 GB 内存，因此无法测试非常大的文件，但对于我的计算机上的打包数组来说，导入时间始终只有几分之一秒。这将解决导入未打包的数据可能会较慢的问题（尽管正如我在主要问题的评论中所说，我无法重现您提到的那种极端缓慢的情况）。

如果 BinaryReadList 很快（它不像现在读取 MX 文件那么快，但看起来像在 Mathematica 9 中速度会显着加快），您可以将整个数据集存储为一个大的二进制文件，而不需要将其分解为单独的 MX 文件。然后您可以像这样导入文件的相关部分：

首先创建一个测试文件：

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

打开它：

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

跳过前 500 万个条目：

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

读取 500 万个条目：

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

读取所有剩余条目：（

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

为了比较，Get 通常在不到 1.5 秒的时间内从 MX 文件读取相同的数据。顺便说一句，我使用的是 WinXP。）

编辑如果您愿意花时间在这上面，并编写一些 C 代码，这是另一个想法。是创建一个库函数（使用库链接），它将内存映射文件（Windows 链接），然后将其直接复制到MTensor 对象（MTensor 只是一个打包的 Mathematica 数组，如从 Library Link 的 C 端看到的那样）。

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.

Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.

Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).

If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

Open it:

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

Skip the first 5 million entries:

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

Read 5 million entries:

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

Read all the remaining entries:

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)

EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).

回复收藏 0 原文