一些帮助在 Java 中抓取页面
我需要使用 Java 抓取网页,并且我了解到正则表达式是一种非常低效的方法,应该将其放入 DOM 文档中进行导航。
我尝试阅读文档,但它似乎太广泛了,我不知道从哪里开始。
你能告诉我如何将这个表抓取到数组中吗?我可以尝试从那里找出我的路。一个片段/示例也可以。
谢谢。
I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以尝试 jsoup:Java HTML 解析器。这是一个优秀的库,有很好的示例代码。
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
这是一个使用 JTidy 和您提供的网页的工作示例,用于从表中提取所有文件名。
结果将如预期的那样为
[整数处理:、图像处理:、相册:、运行时实验:、更多运行时实验:]
。您可以使用的另一个很酷的工具是
Web Harvest
。它基本上完成了我上面所做的所有事情,但使用 XML 文件来配置提取管道。Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
The result will be
[Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]
as expected.Another cool tool that you can use is
Web Harvest
. It basically does everything I did above but using an XML file to configure the extraction pipeline.正则表达式绝对是最佳选择。构建 DOM 过于复杂,本身就需要大量的文本解析。
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
如果您所做的只是将表抓取到数据文件中,那么正则表达式就可以了,甚至可能比使用 DOM 文档更好。 DOM 文档将占用大量内存(尤其是对于非常大的数据表),因此您可能需要一个 SAX 解析器来处理大型文档。
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.