比 Get[“raggedmatrix.mx”] 更快的大量数据导入?
任何人都可以建议进口几个的替代方案吗 来自 60 个 .mx 文件列表的 GB 数字数据(.mx 形式),每个文件大约 650 MB?
- 太大,无法在此发布 - 研究问题涉及简单的统计操作 数据量(大约 34 GB)是可用 RAM(16 GB)的两倍。 为了处理数据大小问题,我只是将数据拆分并使用 进行数学计算的“获取/清除”策略。
它确实有效,但是调用 Get["bigfile.mx"]
需要相当长的时间,所以我想知道使用 BLOB 或 PostgreSQL 或 MySQL 或人们使用的任何数据库是否会更快GB 的数值数据。
所以我的问题确实是: 在 Mathematica 中处理真正大型数据集导入的最有效方法是什么?
我还没有尝试过,但我认为来自 DataBaseLink 的 SQLImport 会比 Get["bigfile.mx “]
。
有人有一些经验可以分享吗?
(抱歉,如果这不是一个非常具体的编程问题,但这确实会帮助我继续进行耗时的找出 137 种可能性中最好的解决方案)数学中的问题)。
Can anybody advise an alternative to importing a couple of
GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?
The - too large to post here - research-problem involved simple statistical operations
with double as much GB of data (around 34) than RAM available (16).
To handle the data size problem I just split things up and used
a Get / Clear strategy to do the math.
It does work, but calling Get["bigfile.mx"]
takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.
So my question really is:
What is the most efficient way to handle truly large data set imports in Mathematica?
I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"]
.
Anyone has some experience to share?
(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个想法:
你说你有一个参差不齐的矩阵,即不同长度的列表的列表。我假设浮点数。
您可以压平矩阵以获得单个长打包一维数组(如果需要,使用Developer`ToPackedArray来打包它),并单独存储子列表的起始索引。然后在导入数据后重建参差不齐的矩阵。
这是在 Mathematica 中(即导入后)的演示,从大的扁平列表中提取子列表的速度很快。
或者,存储一系列子列表长度并使用 Mr.Wizard 的
dynamicPartition
函数 来完成此操作。我的观点是,以平面格式存储数据并在内核中对其进行分区所增加的开销可以忽略不计。将打包数组导入为 MX 文件非常快。我只有 2 GB 内存,因此无法测试非常大的文件,但对于我的计算机上的打包数组来说,导入时间始终只有几分之一秒。这将解决导入未打包的数据可能会较慢的问题(尽管正如我在主要问题的评论中所说,我无法重现您提到的那种极端缓慢的情况)。
如果
BinaryReadList
很快(它不像现在读取 MX 文件那么快,但看起来像 在 Mathematica 9 中速度会显着加快),您可以将整个数据集存储为一个大的二进制文件,而不需要将其分解为单独的 MX 文件。然后您可以像这样导入文件的相关部分:首先创建一个测试文件:
打开它:
跳过前 500 万个条目:
读取 500 万个条目:
读取所有剩余条目:(
为了比较,
Get
通常在不到 1.5 秒的时间内从 MX 文件读取相同的数据。顺便说一句,我使用的是 WinXP。)编辑 如果您愿意花时间在这上面,并编写一些 C 代码,这是另一个想法。是创建一个库函数(使用库链接),它将内存映射文件(Windows 链接),然后将其直接复制到
MTensor
对象(MTensor
只是一个打包的 Mathematica 数组,如从 Library Link 的 C 端看到的那样)。Here's an idea:
You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.
You could flatten the matrix to get a single long packed 1D array (use
Developer`ToPackedArray
to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.
Alternatively store a sequence of sublist lengths and use Mr.Wizard's
dynamicPartition
function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).
If
BinaryReadList
were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:First make a test file:
Open it:
Skip the first 5 million entries:
Read 5 million entries:
Read all the remaining entries:
(For comparison,
Get
usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an
MTensor
object (anMTensor
is just a packed Mathematica array, as seen from the C side of Library Link).我认为两种最好的方法是:
1)在 *.mx 文件上使用 Get,
2)或读取该数据并将其保存为某种二进制格式,您可以为其编写 LibraryLink 代码,然后通过该格式读取内容。当然,这样做的缺点是您需要转换 MX 内容。但也许这是一个选择。
一般来说,获取 MX 文件相当快。
确定这不是交换问题吗?
编辑1:
然后,您还可以在导入转换器中使用写入: tutorial/DevelopingAnImportConverter
I think the two best approaches are either:
1) use Get on the *.mx file,
2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.
Generally speaking Get with MX files is pretty fast.
Are sure this is not a swapping problem?
Edit 1:
You could then use also write in an import converter: tutorial/DevelopingAnImportConverter