当前位置：文江博客话题详情

一些帮助在 Java 中抓取页面

发布于 2024-10-10 03:41:30 字数 250 浏览 5 评论 0原文

我需要使用 Java 抓取网页，并且我了解到正则表达式是一种非常低效的方法，应该将其放入 DOM 文档中进行导航。

我尝试阅读文档，但它似乎太广泛了，我不知道从哪里开始。

你能告诉我如何将这个表抓取到数组中吗？我可以尝试从那里找出我的路。一个片段/示例也可以。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鲸落 2024-10-17 03:41:30

您可以尝试 jsoup：Java HTML 解析器。这是一个优秀的库，有很好的示例代码。

回复收藏 0 原文

还在原地等你 2024-10-17 03:41:30

将您尝试废弃的网页转换为XHTML文档。有几个选项可以使用 Java 执行此操作，例如 JTidy 和HTMLCleaner。这些工具还将自动修复格式错误的 HTML（例如，关闭未关闭的标签）。两者都工作得很好，但我更喜欢 JTidy，因为它与 Java 的 DOM API 集成得更好；
使用XPath提取所需信息表达式。

这是一个使用 JTidy 和您提供的网页的工作示例，用于从表中提取所有文件名。

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

结果将如预期的那样为 [整数处理：、图像处理：、相册：、运行时实验：、更多运行时实验：]。

您可以使用的另一个很酷的工具是 Web Harvest。它基本上完成了我上面所做的所有事情，但使用 XML 文件来配置提取管道。

Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.

Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.

Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.

回复收藏 0 原文