由于 ParserError 对象过多,Jsoup 正在保持完整 GC?

发布于 2024-12-19 06:31:51 字数 1460 浏览 2 评论 0原文

Jsoup是一个非常方便的解析html的工具,在我们的爬虫项目中用作基本util。但最近我发现我们的爬虫有时总是在做full GC。

通过jmap转储对象后,我惊讶地发现有太多的ParseError对象。通过阅读源代码,它不是异常,而是一个对象。当html出现问题时,很可能会导致很多错误。所以应该控制住,防止疯狂创建对象。

以下是一些详细信息,希望能帮助您找到解决方案。

   java.lang.Thread.State: RUNNABLE
        at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
        at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
        at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
        at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
        at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
        at org.jsoup.parser.Parser.parse(Parser.java:24)
        at org.jsoup.Jsoup.parse(Jsoup.java:44)

 num     #instances         #bytes  class name
----------------------------------------------
   1:      30110820     1204432800  org.jsoup.parser.ParseError
   2:         33076      156025088  [Ljava.lang.Object;
   3:         68836       98796360  [C
   4:         65808        9778264  <constMethodKlass>
   5:         65808        8959520  <methodKlass>
   6:         12044        8524088  [B
   7:          6424        7447912  <constantPoolKlass>
   8:        102203        5494560  <symbolKlass>
   9:          6424        4909064  <instanceKlassKlass>
  10:          5271        4171032  <constantPoolCacheKlass>
  11:        105257        3368224  java.lang.String

Jsoup is a very convenient tool to parse html and used as a basic util in our crawler project. But recently I found our crawler was always doing full GC sometimes.

After dumping the object by jmap, I'm amazing to find that there are too many ParseError object. By reading source code, it's not a exception, but an object. When a html has some problem, it will be likely to cause a lot of errors. So it should be under control to prevent create object crazily.

Some detail information as follows, hope it will help you to find the solution.

   java.lang.Thread.State: RUNNABLE
        at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
        at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
        at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
        at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
        at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
        at org.jsoup.parser.Parser.parse(Parser.java:24)
        at org.jsoup.Jsoup.parse(Jsoup.java:44)

 num     #instances         #bytes  class name
----------------------------------------------
   1:      30110820     1204432800  org.jsoup.parser.ParseError
   2:         33076      156025088  [Ljava.lang.Object;
   3:         68836       98796360  [C
   4:         65808        9778264  <constMethodKlass>
   5:         65808        8959520  <methodKlass>
   6:         12044        8524088  [B
   7:          6424        7447912  <constantPoolKlass>
   8:        102203        5494560  <symbolKlass>
   9:          6424        4909064  <instanceKlassKlass>
  10:          5271        4171032  <constantPoolCacheKlass>
  11:        105257        3368224  java.lang.String

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小巷里的女流氓 2024-12-26 06:31:51

@BalusC 感谢您的提示!

仔细阅读源代码后,我发现trackErrors是开放的,并且没有API可以将其设置为false,而且trackErrors毫无用处。
我修复了这个问题并重新发布了包,但我仍然对此感到奇怪,这是一个错误吗?

code1:
    private boolean trackErrors = true;

code2:
    void error(TokeniserState state) {
        if (trackErrors)
            errors.add(new ParseError("Unexpected character in input", reader.current(), state, reader.pos()));
    }

@BalusC thanks for your hint!

After reading source code carefully, I find the trackErrors is open and no API to set it false, even more, trackErrors is useless.
I fix this and republish the package, but I'm still strange about this, is it a mistake?

code1:
    private boolean trackErrors = true;

code2:
    void error(TokeniserState state) {
        if (trackErrors)
            errors.add(new ParseError("Unexpected character in input", reader.current(), state, reader.pos()));
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文