调试 Java 内存不足错误
我仍然是一个相对较新的程序员,我在 Java 中一直遇到的一个问题是内存不足错误。我不想使用 -Xmx 增加内存,因为我觉得错误是由于编程不当造成的,我想改进我的编码而不是依赖更多内存。
我所做的工作涉及处理大量文本文件,压缩后每个文件大小约为 1GB。我这里的代码旨在循环遍历正在删除新压缩文本文件的目录。它打开第二个最新的文本文件(不是最新的,因为仍在写入),并使用 Jsoup 库解析文本文件中的某些字段(字段用自定义分隔符分隔:“|nTa|”指定一个新列,“|nLa|”指定一个新行)。
我觉得应该没有理由使用大量内存。我打开一个文件,扫描它,解析相关位,将解析后的版本写入另一个文件,关闭该文件,然后移至下一个文件。我不需要将整个文件存储在内存中,当然也不需要将已经处理过的文件存储在内存中。
当我开始解析第二个文件时出现错误,这表明我没有处理垃圾收集。请看一下代码,看看您是否能发现我正在做的事情,这意味着我使用了比应有的更多的内存。我想学习如何正确地做到这一点,这样我就不再出现记忆错误了!
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.jsoup.Jsoup;
public class ParseHTML {
public static int commentExtractField = 3;
public static int contentExtractField = 4;
public static int descriptionField = 5;
public static void main(String[] args) throws Exception {
File directoryCompleted = null;
File filesCompleted[] = null;
while(true) {
// find second most recent file in completed directory
directoryCompleted = new File(args[0]);
filesCompleted = directoryCompleted.listFiles();
if (filesCompleted.length > 1) {
TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder());
for (File f : filesCompleted) {
timeStamps.put(getTimestamp(f), f);
}
File fileToProcess = null;
int counter = 0;
for (Long l : timeStamps.keySet()) {
fileToProcess = timeStamps.get(l);
if (counter == 1) {
break;
}
counter++;
}
// start processing file
GZIPInputStream gzipInputStream = null;
if (fileToProcess != null) {
gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess));
}
else {
System.err.println("No file to process!");
System.exit(1);
}
Scanner scanner = new Scanner(gzipInputStream);
scanner.useDelimiter("\\|nLa\\|");
GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName()));
while (scanner.hasNext()) {
Scanner scanner2 = new Scanner(scanner.next());
scanner2.useDelimiter("\\|nTa\\|");
ArrayList<String> row = new ArrayList<String>();
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
for (int index = 0; index < row.size(); index++) {
if (index == commentExtractField ||
index == contentExtractField ||
index == descriptionField) {
output.write(jsoupParse(row.get(index)).getBytes("UTF-8"));
}
else {
output.write(row.get(index).getBytes("UTF-8"));
}
String delimiter = "";
if (index == row.size() - 1) {
delimiter = "|nLa|";
}
else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}
output.finish();
output.close();
scanner.close();
gzipInputStream.close();
}
}
}
public static Long getTimestamp(File f) {
String name = f.getName();
String removeExt = name.substring(0, name.length() - 3);
String timestamp = removeExt.substring(7, removeExt.length());
return Long.parseLong(timestamp);
}
public static String jsoupParse(String s) {
if (s.length() == 4) {
return s;
}
else {
return Jsoup.parse(s).text();
}
}
}
如何确保当我完成对象时,它们被销毁并且不使用任何资源?例如,每次关闭 GZIPInputStream、GZIPOutputStream 和 Scanner 时,如何确保它们被完全销毁?
根据记录,我收到的错误是:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
at ParseHTML.jsoupParse(ParseHTML.java:125)
at ParseHTML.main(ParseHTML.java:81)
I'm still a relatively new programmer, and an issue I keep having in Java is Out of Memory Errors. I don't want to increase the memory using -Xmx, because I feel that the error is due to poor programming, and I want to improve my coding rather than rely on more memory.
The work I do involves processing lots of text files, each around 1GB when compressed. The code I have here is meant to loop through a directory where new compressed text files are being dropped. It opens the second most recent text file (not the most recent, because this is still being written to), and uses the Jsoup library to parse certain fields in the text file (fields are separated with custom delimiters: "|nTa|" designates a new column and "|nLa|" designates a new row).
I feel there should be no reason for using a lot of memory. I open a file, scan through it, parse the relevant bits, write the parsed version into another file, close the file, and move onto the next file. I don't need to store the whole file in memory, and I certainly don't need to store files that have already been processed in memory.
I'm getting errors when I start parsing the second file, which suggests that I'm not dealing with garbage collection. Please have a look at the code, and see if you can spot things that I'm doing that mean I'm using more memory than I should be. I want to learn how to do this right so I stop getting memory errors!
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.jsoup.Jsoup;
public class ParseHTML {
public static int commentExtractField = 3;
public static int contentExtractField = 4;
public static int descriptionField = 5;
public static void main(String[] args) throws Exception {
File directoryCompleted = null;
File filesCompleted[] = null;
while(true) {
// find second most recent file in completed directory
directoryCompleted = new File(args[0]);
filesCompleted = directoryCompleted.listFiles();
if (filesCompleted.length > 1) {
TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder());
for (File f : filesCompleted) {
timeStamps.put(getTimestamp(f), f);
}
File fileToProcess = null;
int counter = 0;
for (Long l : timeStamps.keySet()) {
fileToProcess = timeStamps.get(l);
if (counter == 1) {
break;
}
counter++;
}
// start processing file
GZIPInputStream gzipInputStream = null;
if (fileToProcess != null) {
gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess));
}
else {
System.err.println("No file to process!");
System.exit(1);
}
Scanner scanner = new Scanner(gzipInputStream);
scanner.useDelimiter("\\|nLa\\|");
GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName()));
while (scanner.hasNext()) {
Scanner scanner2 = new Scanner(scanner.next());
scanner2.useDelimiter("\\|nTa\\|");
ArrayList<String> row = new ArrayList<String>();
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
for (int index = 0; index < row.size(); index++) {
if (index == commentExtractField ||
index == contentExtractField ||
index == descriptionField) {
output.write(jsoupParse(row.get(index)).getBytes("UTF-8"));
}
else {
output.write(row.get(index).getBytes("UTF-8"));
}
String delimiter = "";
if (index == row.size() - 1) {
delimiter = "|nLa|";
}
else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}
output.finish();
output.close();
scanner.close();
gzipInputStream.close();
}
}
}
public static Long getTimestamp(File f) {
String name = f.getName();
String removeExt = name.substring(0, name.length() - 3);
String timestamp = removeExt.substring(7, removeExt.length());
return Long.parseLong(timestamp);
}
public static String jsoupParse(String s) {
if (s.length() == 4) {
return s;
}
else {
return Jsoup.parse(s).text();
}
}
}
How can I make sure that when I finish with objects, they are destroyed and not using any resources? For example, each time I close the GZIPInputStream, GZIPOutputStream and Scanner, how can I make sure they're completely destroyed?
For the record, the error I'm getting is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
at ParseHTML.jsoupParse(ParseHTML.java:125)
at ParseHTML.main(ParseHTML.java:81)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我没有花很长时间分析您的代码(没有什么突出的),但是一个好的通用目的开始是熟悉免费的 VisualVM 工具。 这是一个合理的使用指南,尽管还有更多文章。
在我看来,有更好的商业分析器 - JProfiler 就是其中之一 - 但它至少会向您显示大多数内存被分配给哪些对象/类,以及可能导致这种情况发生的方法堆栈跟踪。更简单地说,它显示了一段时间内的堆分配情况,您可以使用它来判断是否无法清除某些内容或者它是否是不可避免的峰值。
我建议这样做,而不是查看代码的细节,因为这是一种有用的诊断技能。
I haven't spent very long analysing your code (nothing stands out), but a good general-purpose start would be to familiarise yourself with the free VisualVM tool. This is a reasonable guide to its use, though there are many more articles.
There are better commercial profilers in my opinion - JProfiler for one - but it will at the very least show you what objects/classes most memory is being assigned to, and possibly the method stack traces that caused that to happen. More simply it shows you heap allocation over time, and you can use this to judge whether you are failing to clear something or whether it is an unavoidable spike.
I suggest this rather than looking at the specifics of your code because it is a useful diagnostic skill to have.
更新:此问题已在 JSoup 1.6.2 中修复。
在我看来,这可能是您正在使用的 JSoup 解析器中的一个错误...目前 JSoup.parse() 的文档有一个警告“BETA:如果您确实引发了异常,或者错误的解析树,请提交一个 漏洞。”这表明他们并不确信它在生产代码中使用是完全安全的。
我还发现了几个提到内存不足异常的错误报告,其中一个表明这是由于解析由 JSoup 静态保存的错误对象,并且从 JSoup 1.6.1 降级到 1.5.2 可能是一种解决方法。
Update: This issue was fixed in JSoup 1.6.2
It looks to me like it's probably a bug in the JSoup parser that you're using...at present the documentation for JSoup.parse() has a warning "BETA: if you do get an exception raised, or a bad parse-tree, please file a bug." Which suggests they aren't confident that it's completely safe for use in production code.
I also found several bug reports mentioning out of memory exceptions, one of which suggests that it's due to parse error objects being held statically by JSoup, and that downgrading from JSoup 1.6.1 to 1.5.2 may be a work-around.
我想知道您的解析是否失败是因为您解析了错误的 HTML(例如未闭合的标签、不成对的引号或诸如此类的东西)?您可以执行输出 /println 来查看您在文档中的进展情况(如果有的话)。在内存耗尽之前,Java 库可能无法理解文档/文件的结尾。
解析
public static Document parse(String html) 将 HTML 解析为文档。由于未指定基本 URI,因此绝对 URL 检测依赖于包含标签的 HTML。
http://jsoup.org/apidocs/org/jsoup /Jsoup.html#parse(java.lang.String)
I am wondering if your parse is failing because you have bad HTML (e.g. unclosed tags, unpaired quotes or whatnot) being parsed? You could do a output /println to see how far you are getting in the document if at all. The Java library may not understand the end of the document /file before running out of memory.
parse
public static Document parse(String html) Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a tag.
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)
很难说清楚发生了什么,但我想到了两件事。
1)在一些奇怪的情况下(取决于输入文件),以下循环可能会将整个文件加载到内存中:
2)通过查看 stackTrace 似乎 jsoupParse 是问题所在。我相信这一行
Jsoup.parse(s).text();
首先将s
加载到内存中,并取决于字符串大小(这又取决于特定的文件输入)这可能会导致OutOfMemoryError
也许上述两点的组合就是问题所在。同样,仅通过查看代码很难判断。
同一个文件是否总是发生这种情况?你检查一下输入的内容和其中的自定义分隔符吗?
It's a little hard to tell what's going on but two things come to my mind.
1) In some weird circumstances (depending on the input file), the following loop might load the entire file into memory:
2) By looking at the stackTrace it seems that the jsoupParse is the problem. I believe that this line
Jsoup.parse(s).text();
loadss
into memory first and depending on the string size (that again depends on the particular file input) this might cause theOutOfMemoryError
Maybe a combination of the two points above is the issue. Again, it's hard to tell by just looking at the code..
Does this happen always with the same file? Did you check the input content and the custom delimiters in it?
假设问题不在 JSoup 代码中,我们可以做一些内存优化。例如,ArrayListrow 可以被剥离,因为它在内存中保存所有已解析的行,但解析只需要一行。
删除了
row
的内部循环:Assuming the problem is not in JSoup code, we can do some memory optimization. In example,
ArrayList<String> row
could be stripped, as it holds all parsed lines in memory, but only one line needed for parsing.Inner loop with
row
removed: