如何在 Google App Engine 上执行网页抓取以查找 Java 中的特定链接页面?
我需要从不提供 RSS 源的远程网站检索文本。
我所知道的是,我需要的数据始终位于从主页 (http://www.example.com/
) 链接到的页面上,其中包含文本“ Invoices”报告
”。
例如:
<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>
因此,我需要找到主页上与此模式匹配的所有链接,然后从位于名为
。是否有 Java 工具可以帮助实现此目的?是否有专门针对 Google App Engine for Java 的工具可用于实现此目的?
I need to retrieve text from a remote web site that does not provide an RSS feed.
What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/
) with a link that contains the text " Invoices Report
".
For example:
<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>
So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body">
.
Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请查看 http://code.google.com/appengine/docs/java /urlfetch/overview.html
您可以使用 UrlFetch 服务逐行读取 www.example.com/index.html,并使用正则表达式查找“发票报告”。
如果链接可能位于多行上,您可能需要不同类型的阅读器。
Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html
You can use the UrlFetch service to read www.example.com/index.html line-by-line, and use a regular expression to look for "Invoices Report."
You might need a different kind of reader if the link might be on multiple lines.