用Java处理大数据结构
我正在开发一个需要处理非常大的矩阵的 Java 应用程序。 例如两个1000万*1000万的矩阵相乘! 当然,Java 堆甚至没有足够的空间来存储这些矩阵之一。 我应该怎么办? 我应该使用数据库来存储我的矩阵并将每个需要的部分带入内存并将其逐个相乘吗?
I'm working on a Java application that needs working on very large matrices. For example multiplying two 10 million * 10 million matrices!
Of course the Java heap does not have enough space even for storing one of these matrices.
What should I do?
Should I use databases to store my matrices and bring to memory every needed part and multiply it part after another?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
查看 hadoop。
Have a look at hadoop.
尝试使用内存映射文件,将所有数据存储在外部文件中并通过以下方式访问它文件通道对象。
查看这篇文章,了解 MMF 的简要介绍。
Try using Memory Mapped File by storing all your data in an external file and access it via FileChannel object.
Check out this article for a brief introduction to MMF.
看看 CGL-MapReduce
http://www.cs.indiana.edu/~jekanaya/cglmr。 html#Matrix_Multiplication
Have a look at CGL-MapReduce
http://www.cs.indiana.edu/~jekanaya/cglmr.html#Matrix_Multiplication
首先,1000 万 x 1000 万的矩阵实在是太大了。 假设每个单元加倍且没有存储过量,则每个单元的容量将达到 800 TB。 仅从主存储器中读取每个单元格一次(如果它以某种方式神奇地适合那里,这显然不会发生),将需要几天的时间。 从任何类型的 SAN(我们将其放在 10GbE 上)完成此操作很可能需要几个月的时间。 并且没有矩阵乘法具有 O(n) 复杂度 - 正常方法是 O(n^3)。 所以...您不是使用内存映射文件、通用数据库或任何此类内容来执行此操作。
执行此类操作的代码的存亡取决于缓存效率,其中“缓存”包括充分利用主内存、本地磁盘驱动器。 由于任何容纳超过 800 TB 矩阵的存储接口都必然是某种 SAN,因此几乎肯定会涉及多个服务器读取并处理其不同部分。
有许多众所周知的方法可以并行矩阵乘法(本质上是乘以各种大小的子矩阵,然后组合结果)和移位布局,以便通过围绕 空间填充曲线而不是行/列排列。 您肯定想看看经典的 LAPACK 界面和设计,英特尔的 MKL,GotoBLAS 作为针对特定现代硬件调整的 BLAS 函数的实现,之后您可能会冒险进入未探索的领域:-)
First off, a 10 million x 10 million matrix is simply enormous. Assuming doubles for each cell and no storage overhaed, each one of these things is going to be 800 terabytes. Just reading each cell once over from main memory (should it somehow magically fit there, which clearly isn't happening), would take days. Doing it from any sort of plausible SAN (we'll put it on 10GbE) is more likely to be months. And no matrix multiply has O(n) complexity - the normal approaches are O(n^3). So... you aren't doing this with memory mapped files, common databases, or anything of that sort.
Code doing something like this is going to live or die on cache efficiency, where "cache" includes making good use of main memory, local disk drives. Since any storage interface holding more than one 800 terabyte matrix is bound to be a SAN of some sort, you almost certainly involve multiple servers reading and working on different parts of it, too.
There are lots of well-known ways to parallelise matrix multiplication (essentially multiply various-sized sub-matrices and then combining the results), and shift layout so that the access patterns have reasonable cache locality by organizing the data around space-filling curves instead of row/column arrangements. You're certainly going to want to look at the classic LAPACK interfaces and design, Intel's MKL, GotoBLAS as implementations of the BLAS functions tuned to specific modern hardware, and after that you're probably venturing into unexplored territory :-)
如果简单地执行,矩阵乘法的复杂度为 O(n^3),但确实存在更有效的算法。 无论如何,对于 1000 万 * 1000 万的矩阵,这将需要很长时间,并且您可能会面临相同的堆问题,但具有递归性。
如果您对复杂的数学感兴趣,您可能会在本文中找到可以帮助您的工具。
The complexity of matrix multiplication, if carried out naively, is O(n^3), but more efficient algorithms do exist. Anyway for a 10 millions * 10 millions matrix this is going to take a very long time and you may will face the same heap probelm but with recursivity.
If you're into complex maths you may find tool to help you in this article.
考虑使用内存数据库,例如 http://hsqldb.org/
consider using a memory db like http://hsqldb.org/
由于这是一个巨大的计算,我认为您会遇到性能问题以及存储问题。 因此,我会考虑并行化这个问题,并让多个机器/核心来处理数据子集。
幸运的是,矩阵乘法解决方案会自然分解。 但我会考虑某种形式的网格或分布式计算解决方案。
Since this is such a huge calculation, I think you're going to run into performance problems alongside your storage problems. So I would look at parallelising this problem, and getting mutliple machines/cores to process a subset of data.
Luckily a matrix multiplication solution will decompose naturally. But I would be looking at some form of grid or distributed computing solution.
使用适用于您的数据的任何稀疏矩阵算法。 (假设您没有 2.4 PB 的磁盘空间来容纳 3 个 10^8 平方非稀疏双精度矩阵,更不用说用于内存数据库的 RAM 了 - Blue Gene/Q “仅”有1.6 PB。)
Use whatever sparse matrix algorithm applies to your data. ( on the assumption that you don't have 2.4 PB of disk space to hold 3 off 10^8 square non-sparse matrices of doubles, let alone that much RAM for an in-memory database - Blue Gene/Q 'only' has 1.6 PB .)
好吧,如果您被迫使用 Java 并且无法编写作为本机方法处理此问题的代码(即,通过告诉 Java 调用一些 C 代码来代替),那么最有效的做法就是使用一个简单的方法二进制文件。 在这种情况下,我会远离数据库,因为它们比直接文件访问慢,并且您不需要它们提供的功能。
Well if you are forced to use Java and can't write the code that deals with this as native methods (that is, by telling Java to call some C code instead) then the most efficient thing to do would properly be to use a simple binary file. I would stay away from databases in this case because they are slower than direct file access and you don't need the features they offer.