好的,所以我正在编写一个程序,不幸的是需要使用巨大的数据结构来完成其工作,但它在初始化期间因“内存不足错误”而失败。虽然我完全理解这意味着什么以及为什么这是一个问题,但我很难克服它,因为我的程序需要使用这个大型结构,并且我不知道任何其他方式来存储它。
该程序首先索引我提供的大量文本文件。这很好用。
然后它使用这个索引来初始化一个大的二维数组。该数组将有 n² 个条目,其中“n”是文本语料库中唯一单词的数量。对于我正在测试的相对较小的块(大约 60 个文件),它需要创建大约 30,000x30,000 个条目。一旦我在完整的预期语料库上运行它,它可能会更大。
在索引之后,在初始化数据结构(稍后处理)时,它每次都会失败。
我所做的事情包括:
- 修改我的代码以使用原始
int[]
而不是 TreeMap
- 消除冗余结构等...
- 另外,我已经运行了该程序
-Xmx2g
最大化分配的内存
我相当有信心这不会是一个简单的代码解决方案,但很可能需要一种非常新的方法。我正在寻找这种方法是什么,有什么想法吗?
谢谢,
B.
OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
- revamp my code to use a primitive
int[]
instead of a TreeMap
- eliminate redundant structures, etc...
- Also, I have run the program with
-Xmx2g
to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
发布评论
评论(4)
听起来(对您使用数组的用途做出一些假设)大多数条目将为 0。如果是这样,您可能会考虑使用 稀疏矩阵表示。
如果您确实有那么多条目(您当前的数组位于已经超过 3 GB,即使假设没有开销),那么您将不得不使用某种磁盘存储,或延迟加载/卸载系统。
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
内存不足问题的原因有多种。
首先,最简单的情况是您只需要更多堆。当您的程序可以在 2G 下正常运行时,您正在使用 512M 最大堆。使用
-Xmx2048m
作为 JVM 选项来增加,就可以了。另请注意,64 位 VM 使用的内存最多是 32 位 VM 的两倍,具体取决于数据的构成。如果您的问题不是那么简单,那么您可以考虑优化。用原语替换对象等等。这可能是一个选择。根据您发布的内容,我真的无法说。
然而最终您会遇到一个十字路口,您必须在虚拟化和分区之间做出选择。
在这种情况下,虚拟化仅仅意味着某种形式的假装内存比实际多。操作系统将其与虚拟地址空间一起使用,并使用硬盘空间作为额外内存。这可能意味着一次仅将一些数据结构保留在内存中,并将其余数据持久保存到辅助存储(例如文件或数据库)。
分区是将您的数据分割到多个服务器(真实的或虚拟的)上。例如,如果您要跟踪纳斯达克的股票交易,您可以在服务器 1 上放置以“A”开头的股票代码,在服务器 2 上放置以“B”开头的股票代码,等等。您需要找到一种合理的方法来分割数据,以便减少或者消除交叉通信的需要,因为交叉通信限制了您的可扩展性。
如此简单的情况,如果您要存储的是 30K 单词和 30K x 30K 单词组合,您可以将其分为四个服务器:
这只是一个想法。同样,在不了解具体情况的情况下很难发表评论。
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with
-Xmx2048m
as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
That's just one idea. Again it's hard toc omment without knowing specifics.
这是处理大型数据集时的常见问题。您可以根据需要进行优化,但内存永远不会足够(可能),并且一旦数据集增长一点,您仍然会被熏。最具可扩展性的解决方案就是减少内存占用,处理块,并将结构保留在磁盘(数据库/文件)上。
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
如果您的 2D 数组中的每个值不需要完整的 32 位(整数大小),也许更小的类型(例如字节)可以解决问题?此外,您还应该为其提供尽可能多的堆空间 - 对于现代系统来说 2GB 仍然相对较小。 RAM 很便宜,尤其是当您希望在内存中进行大量处理时。
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.