用Java解析HTML网页
我需要解析/读取大量 HTML 网页(100+)以获取特定内容(几行几乎相同的文本)。
我使用带有 reg 的扫描仪对象。表达式和 jsoup 及其 html 解析器。
这两种方法都很慢,并且使用 jsoup 时出现以下错误: java.net.SocketTimeoutException:读取超时(具有不同连接的多台计算机)
有什么更好的吗?
编辑:
现在我已经让 jsoup 开始工作了,我认为更好的问题是如何加快速度?
I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error:
java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
Now that I've gotten jsoup to work, I think a better question is how do I speed it up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
xpath 是一项值得学习的重要技能。这对那份工作来说是完美的!我刚刚开始自己学习自动化测试。如果您有疑问,请给我留言。尽管我不是专家,但我很乐意为您提供帮助。
既然您对 Java 感兴趣,这里有一个很好的链接:
http://www.ibm.com/developerworks/library/x-javaxpathapi /index.html
当您不使用 Java 时,了解 xpath 也是一件好事,所以这就是我选择该路线的原因。
A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.
Here's a nice link since you are interested in Java:
http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.
您是否尝试过延长 JSoup 的超时时间?我相信默认情况下只有 3 秒。请参阅此。
Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.
我建议使用 Nutch,这是一个开源 Web 搜索解决方案,包含对 HTML 解析的支持。这是一个非常成熟的图书馆。它在底层使用 Lucene,我发现它是一个非常可靠的爬虫。
I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.