由于 ParserError 对象过多,Jsoup 正在保持完整 GC?
Jsoup是一个非常方便的解析html的工具,在我们的爬虫项目中用作基本util。但最近我发现我们的爬虫有时总是在做full GC。
通过jmap转储对象后,我惊讶地发现有太多的ParseError对象。通过阅读源代码,它不是异常,而是一个对象。当html出现问题时,很可能会导致很多错误。所以应该控制住,防止疯狂创建对象。
以下是一些详细信息,希望能帮助您找到解决方案。
java.lang.Thread.State: RUNNABLE
at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
num #instances #bytes class name
----------------------------------------------
1: 30110820 1204432800 org.jsoup.parser.ParseError
2: 33076 156025088 [Ljava.lang.Object;
3: 68836 98796360 [C
4: 65808 9778264 <constMethodKlass>
5: 65808 8959520 <methodKlass>
6: 12044 8524088 [B
7: 6424 7447912 <constantPoolKlass>
8: 102203 5494560 <symbolKlass>
9: 6424 4909064 <instanceKlassKlass>
10: 5271 4171032 <constantPoolCacheKlass>
11: 105257 3368224 java.lang.String
Jsoup is a very convenient tool to parse html and used as a basic util in our crawler project. But recently I found our crawler was always doing full GC sometimes.
After dumping the object by jmap, I'm amazing to find that there are too many ParseError object. By reading source code, it's not a exception, but an object. When a html has some problem, it will be likely to cause a lot of errors. So it should be under control to prevent create object crazily.
Some detail information as follows, hope it will help you to find the solution.
java.lang.Thread.State: RUNNABLE
at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
num #instances #bytes class name
----------------------------------------------
1: 30110820 1204432800 org.jsoup.parser.ParseError
2: 33076 156025088 [Ljava.lang.Object;
3: 68836 98796360 [C
4: 65808 9778264 <constMethodKlass>
5: 65808 8959520 <methodKlass>
6: 12044 8524088 [B
7: 6424 7447912 <constantPoolKlass>
8: 102203 5494560 <symbolKlass>
9: 6424 4909064 <instanceKlassKlass>
10: 5271 4171032 <constantPoolCacheKlass>
11: 105257 3368224 java.lang.String
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
@BalusC 感谢您的提示!
仔细阅读源代码后,我发现trackErrors是开放的,并且没有API可以将其设置为false,而且trackErrors毫无用处。
我修复了这个问题并重新发布了包,但我仍然对此感到奇怪,这是一个错误吗?
@BalusC thanks for your hint!
After reading source code carefully, I find the trackErrors is open and no API to set it false, even more, trackErrors is useless.
I fix this and republish the package, but I'm still strange about this, is it a mistake?