海量数据处理导致Java内存泄漏
我目前正在开发一个处理多个文件的应用程序,每个文件包含大约 75,000 条记录(以二进制格式存储)。当此应用程序运行时(手动,大约每月一次),大约 100 万条记录完全包含在文件中。文件被放入一个文件夹中,单击进程,它会将其存储到 MySQL 数据库 (table_1) 中
。这些记录包含需要与包含超过 700k 记录的另一个表 (table_2) 进行比较的信息。
我采用了几种方法:
方法 1:立即导入,稍后处理
在这种方法中,我将数据导入到数据库中,而不从其他表进行任何处理。然而,当我想对收集的数据运行报告时,假设内存泄漏(崩溃前总共使用了 1 GB),它会崩溃。
方法二:立即导入,使用MySQL处理
这是我想做的,但实际效果似乎不太好。在此我将编写查找 table_1 和 table_2 之间的相关性的逻辑。然而,MySQL 结果很大,我无法获得一致的输出,有时会导致 MySQL 放弃。
方法 3:立即导入,立即处理
我目前正在尝试此方法,虽然内存泄漏很细微,但在崩溃之前仍然只达到大约 200,000 条记录。我一路上尝试了无数次强制垃圾收集,正确销毁类等。似乎有什么东西在与我作斗争。
我束手无策,试图解决内存泄漏/应用程序崩溃的问题。我不是 Java 专家,也还没有真正处理过 MySQL 中的大量数据。任何指导都会非常有帮助。我已经考虑了这些方法:
- 将每行进程分解为单独的类,希望消除每行上的任何内存使用
- 情况 某种存储例程,其中一旦将一行存储到数据库中,MySQL 就会执行 table_1 <=> 操作。 table_2 计算并存储结果
但我想向许多熟练的 Stack Overflow 成员提出问题,以了解如何正确处理这个问题。
I am currently developing an application that processes several files, containing around 75,000 records a piece (stored in binary format). When this app is ran (manually, about once a month), about 1 million records are contained entirely with the files. Files are put in a folder, click process and it goes and stores this into a MySQL database (table_1)
The records contain information that needs to be compared to another table (table_2) containing over 700k records.
I have gone about this a few ways:
METHOD 1: Import Now, Process Later
In this method, I would import the data into the database without any processing from the other table. However when I wanted to run a report on the collected data, it would crash assuming memory leak (1 GB used in total before crash).
METHOD 2: Import Now, Use MySQL to Process
This was what I would like to do but in practice it didn't seem to turn out so well. In this I would write the logic in finding the correlations between table_1 and table_2. However the MySQL result is massive and I couldn't get a consistent output, sometimes causing MySQL giving up.
METHOD 3: Import Now, Process Now
I am currently trying this method and although the memory leak is subtle, It still only gets to about 200,000 records before crashing. I have tried numerous forced garbage collections along the way, destroying properly classes, etc. It seems something is fighting me.
I am at my wits end trying to solve the issue with memory leaking / the app crashing. I am no expert in Java and have yet to really deal with very large amounts of data in MySQL. Any guidance would be extremely helpful. I have put thought into these methods:
- Break each line process into individual class, hopefully expunging any memory usage on each line
- Some sort of stored routine where once a line is stored into the database, MySQL does the table_1 <=> table_2 computation and stores the result
But I would like to pose the question to the many skilled Stack Overflow members to learn properly how this should be handled.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我同意“使用分析器”的答案。
但我只想指出您问题中的一些误解:
存储泄漏并不是由于大量数据处理造成的。这是由于一个错误。 “海量”只会让症状更加明显。
运行垃圾收集器并不能解决存储泄漏问题。 JVM 总是在决定放弃并抛出 OOME 之前立即运行完整的垃圾收集。
如果没有更多有关您正在尝试执行的操作以及如何执行此操作的信息,则很难就实际可能导致存储泄漏的原因提供建议。
I concur with the answers that say "use a profiler".
But I'd just like to point out a couple of misconceptions in your question:
The storage leak is not due to massive data processing. It is due to a bug. The "massiveness" simply makes the symptoms more apparent.
Running the garbage collector won't cure a storage leak. The JVM always runs a full garbage collection immediately before it decides to give up and throw an OOME.
It is difficult to give advice on what might actually be causing the storage leak without more information on what you are trying to do and how you are doing it.
像 VirtualVM 这样的分析器的学习曲线非常小。运气好的话,你会在一个小时左右的时间内得到答案——至少是一条非常大的线索。
The learning curve for a profiler like VirtualVM is pretty small. With luck, you'll have an answer - at least a very big clue - within an hour or so.
您可以通过以下任一方法正确处理这种情况:
我个人更喜欢 yjp,但有一些不错的免费以及应用程序(例如 jvisualvm 和 netbeans)
you properly handle this situation by either:
i personally prefer yjp, but there are some decent free apps as well (e.g. jvisualvm and netbeans)
在不太了解自己在做什么的情况下,如果内存不足,可能会在某个时刻将所有内容存储在 jvm 中,但您应该能够执行像这样的数据处理任务(严重的内存问题)你正在经历。在过去,我见过内存不足的数据处理管道,因为有一个类从数据库中读取内容,将其全部包装在一个漂亮的集合中,然后将其传递给另一个类,这当然需要所有数据同时存入内存。框架非常适合隐藏这类事情。
使用 virtualVm 进行堆转储/挖掘对我来说并不是很有帮助,因为我正在寻找的细节通常是隐藏的 - 例如,如果您有大量内存充满了字符串映射,那么告诉它并没有真正帮助如果您知道字符串是内存使用量最大的组成部分,那么您需要知道谁拥有它们。
您能否发布有关您要解决的实际问题的更多详细信息?
Without knowing too much about what you're doing, if you're running out of memory there's likely some point where you're storing everything in the jvm, but you should be able to do a data processing task like this the severe memory problems you're experiencing. In the past, I've seen data processing pipelines that run out of memory because there's one class reading stuff out of the db, wrapping it all up in a nice collection, and then passing it off to another, which of course requires all of the data to be in memory simultaneously. Frameworks are good for hiding this sort of thing.
Heap dumps/digging with virtualVm hasn't been terribly helpful for me , as the details I'm looking for are often hidden - e.g. If you've got a ton of memory filled with maps of strings it doesn't really help to tell you that Strings are the largest component of your memory useage, you sort of need to know who owns them.
Can you post more detail about the actual problem you're trying to solve?