当前位置：文江博客话题详情

HTML Android web-scraping

在 Android 中抓取 HTML 网页最快的方法是什么？

发布于 2024-09-04 13:23:29 字数 352 浏览 2 评论 0 原文

我需要从 Android 中的非结构化网页中提取信息。我想要的信息嵌入在没有 id 的表中。

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>

我应该使用

模式匹配吗？
使用BufferedReader提取信息？

或者有没有更快的方法来获取这些信息？

原文

I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id.

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>

Should I use

Pattern Matching?
Use BufferedReader to extract the information?

Or are there faster way to get that information?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情痴 2024-09-11 13:23:30

为什么不创建一个使用 cURL 和简单 html dom 解析器进行抓取的脚本，然后获取您需要从该页面获得价值吗？这些工具可与 PHP 配合使用，但也存在适用于您需要的任何语言的其他工具。

回复收藏 0 原文

べ映画 2024-09-11 13:23:30

一种方法是将 html 放入字符串中，然后手动搜索并解析该字符串。如果您知道标签将按特定顺序出现，那么您应该能够爬行它并找到数据。然而，这有点草率，所以问题是你希望它现在工作吗？或者工作很好？

int position = (String)html.indexOf("<table>");  //html being the String holding the html code
String field = html.substring(html.indexOf("<td>",html.indexOf("<td>",position)) + 4, html.indexOf("</td>",html.indexOf("</td>",position)));

就像我说的……真的很草率。但如果您只这样做一次并且需要它发挥作用，那么这可能会起作用。

One way of doing this is to put the html into a String and then manually search and parse through the String. If you know that the tags will come in a specific order then you should be able to crawl through it and find the data. This however is kinda sloppy, so its a question of do you want it to work now? or work well?

int position = (String)html.indexOf("<table>");  //html being the String holding the html code
String field = html.substring(html.indexOf("<td>",html.indexOf("<td>",position)) + 4, html.indexOf("</td>",html.indexOf("</td>",position)));

like i said... really sloppy. But if you're only doing this once and you need it to work, this just might do the trick.

回复收藏 0 原文

江湖彼岸 2024-09-11 13:23:29

我认为在这种情况下，寻找一种快速方式来提取信息是没有意义的，因为当您进行比较时，答案中已经建议的方法之间几乎没有性能差异它是下载 HTML 所需的时间。

因此，假设“最快”意味着最方便、可读和可维护的代码，我建议您使用 DocumentBuilder 解析相关 HTML 并使用 XPathExpressions：

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

如果您碰巧检索到无效的 HTML，我建议隔离相关部分（例如使用 substring (indexOf(") 并在解析之前使用 String 操作纠正剩余的 HTML 错误，但是如果这变得太复杂（即非常糟糕< /em> HTML），只需使用其他答案中建议的 hacky 模式匹配方法即可。

备注

XPath 自 API 级别 8 (Android 2.2) 起可用，如果您针对较低的 API 级别进行开发，则可以使用。用于导航到要提取的节点的 DOM 方法和条件

I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.

So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder to parse the relevant HTML and extract data using XPathExpressions:

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with String operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.

Remarks

XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract

回复收藏 0 原文

手心的温暖 2024-09-11 13:23:29

最快的方法是自己解析特定信息。您似乎已经预先了解了 HTML 结构。 BufferedReader , 字符串和 StringBuilder方法应该足够了。这是一个启动示例，显示您自己的问题的第一段：

public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

在几乎所有情况下，解析肯定比模式匹配更快。模式匹配更容易，但存在一定的风险，即它可能会产生意外结果，尤其是在使用复杂的正则表达式模式时。

您还可以考虑使用更灵活的第 3 方 HTML 解析器，而不是自己编写一个。它不会像用事先已知的信息解析自己那么快。但它将更加简洁和灵活。使用不错的 HTML 解析器，速度差异可以忽略不计。为此，我强烈推荐 Jsoup。它支持类似 jQuery 的 CSS 选择器。提取问题的第一段就很简单：

public static void main(String... args) throws Exception {
    Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
    String question = document.select("#question .post-text p").first().text();
    System.out.println(question);
}

不清楚您在谈论哪个网页，因此我无法给出更详细的示例，说明如何使用 Jsoup 从特定页面中选择特定信息。如果您仍然无法使用 Jsoup 和 CSS 选择器自行计算，那么请随意在评论中发布 URL，我会建议如何操作。

The fastest way will be parsing the specific information yourself. You seem to know the HTML structure precisely beforehand. The BufferedReader, String and StringBuilder methods should suffice. Here's a kickoff example which displays the first paragraph of your own question:

public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

Parsing is in practically all cases definitely faster than pattern matching. Pattern matching is easier, but there is a certain risk that it may yield unexpected results, certainly when using complex regex patterns.

You can also consider to use a more flexible 3rd party HTML parser instead of writing one yourself. It will not be as fast as parsing yourself with beforehand known information. It will however be more concise and flexible. With decent HTML parsers the difference in speed is pretty negligible. I strongly recommend Jsoup for this. It supports jQuery-like CSS selectors. Extracting the firsrt paragraph of your question would then be as easy as:

public static void main(String... args) throws Exception {
    Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
    String question = document.select("#question .post-text p").first().text();
    System.out.println(question);
}

It's unclear what web page you're talking about, so I can't give a more detailed example how you could select the specific information from the specific page using Jsoup. If you still can't figure it at your own using Jsoup and CSS selectors, then feel free to post the URL in a comment and I'll suggest how to do it.

回复收藏 0 原文