如何在 Google App Engine 上执行网页抓取以查找 Java 中的特定链接页面？

发布于 2024-09-27 19:04:25 字数 561 浏览 0 评论 0原文

我需要从不提供 RSS 源的远程网站检索文本。

我所知道的是，我需要的数据始终位于从主页 (http://www.example.com/) 链接到的页面上，其中包含文本“ Invoices”报告”。

例如：

<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>

因此，我需要找到主页上与此模式匹配的所有链接，然后从位于名为

。

是否有 Java 工具可以帮助实现此目的？是否有专门针对 Google App Engine for Java 的工具可用于实现此目的？

原文

I need to retrieve text from a remote web site that does not provide an RSS feed.

What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ".

For example:

<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>

So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body">.

Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

划一舟意中人 2024-10-04 19:04:25

请查看 http://code.google.com/appengine/docs/java /urlfetch/overview.html

您可以使用 UrlFetch 服务逐行读取 www.example.com/index.html，并使用正则表达式查找“发票报告”。

URL url = new URL("http://www.example.com/index.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;

while ((line = reader.readLine()) != null) {
    checkLineForTextAndAddLinkOrWhatever(line);
}
reader.close();

如果链接可能位于多行上，您可能需要不同类型的阅读器。

Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html

You can use the UrlFetch service to read www.example.com/index.html line-by-line, and use a regular expression to look for "Invoices Report."

URL url = new URL("http://www.example.com/index.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;

while ((line = reader.readLine()) != null) {
    checkLineForTextAndAddLinkOrWhatever(line);
}
reader.close();

You might need a different kind of reader if the link might be on multiple lines.

回复收藏 0 原文

~没有更多了~