一些帮助在 Java 中抓取页面

发布于 2024-10-10 03:41:30 字数 250 浏览 0 评论 0原文

我需要使用 Java 抓取网页,并且我了解到正则表达式是一种非常低效的方法,应该将其放入 DOM 文档中进行导航。

我尝试阅读文档,但它似乎太广泛了,我不知道从哪里开始。

你能告诉我如何将这个表抓取到数组中吗?我可以尝试从那里找出我的路。一个片段/示例也可以。

谢谢。

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.

I've tried reading the documentation but it seems too extensive and I don't know where to begin.

Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鲸落 2024-10-17 03:41:30

您可以尝试 jsoup:Java HTML 解析器。这是一个优秀的库,有很好的示例代码。

You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.

还在原地等你 2024-10-17 03:41:30
  1. 将您尝试废弃的网页转换XHTML文档。有几个选项可以使用 Java 执行此操作,例如 JTidyHTMLCleaner。这些工具还将自动修复格式错误的 HTML(例如,关闭未关闭的标签)。两者都工作得很好,但我更喜欢 JTidy,因为它与 Java 的 DOM API 集成得更好;
  2. 使用XPath提取所需信息表达式。

这是一个使用 JTidy 和您提供的网页的工作示例,用于从表中提取所有文件名。

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

结果将如预期的那样为 [整数处理:、图像处理:、相册:、运行时实验:、更多运行时实验:]

您可以使用的另一个很酷的工具是 Web Harvest。它基本上完成了我上面所做的所有事情,但使用 XML 文件来配置提取管道。

  1. Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
  2. Extract required information using XPath expressions.

Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.

Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.

(り薆情海 2024-10-17 03:41:30

正则表达式绝对是最佳选择。构建 DOM 过于复杂,本身就需要大量的文本解析。

Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.

久夏青 2024-10-17 03:41:30

如果您所做的只是将表抓取到数据文件中,那么正则表达式就可以了,甚至可能比使用 DOM 文档更好。 DOM 文档将占用大量内存(尤其是对于非常大的数据表),因此您可能需要一个 SAX 解析器来处理大型文档。

If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文