如何从网页(Java)中抓取文本?
我计划编写一个简单的 J2SE 应用程序来聚合来自多个 Web 源的信息。
我认为,最困难的部分是从网页中提取有意义的信息(如果它不能作为 RSS 或 Atom 提要提供)。 例如,我可能想从 stackoverflow 中提取问题列表,但我绝对不需要那么大的标签云或导航栏。
您会建议什么技术/库?
更新/备注
- 速度并不重要 - 只要它能够在 10 分钟内解析大约 5MB 的 HTML 即可。
- 这应该很简单。
I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
- Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
- It sould be really simple.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
您是否考虑过利用 RSS/Atom 提要? 当内容通常以可消费格式提供给您时,为什么要抓取内容呢? 有一些库可用于以您能想到的任何语言使用 RSS,并且与尝试抓取内容相比,它对页面标记的依赖要少得多。
如果您绝对必须抓取内容,请在标记中查找 微格式,大多数博客(尤其是基于 WordPress 的博客)默认都有此内容。 还有一些库和解析器可用于从网页中查找和提取微格式。
最后,诸如Yahoo Pipes之类的聚合服务/应用程序也许能够为您完成这项工作,而无需重新发明车轮。
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
看看这个 http://www.alchemyapi.com/api/demo.html
他们返回相当好的结果,并且有适用于大多数平台的 SDK。 不仅进行文本提取,还进行关键词分析等。
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.
您可以将 HTMLParser (http://htmlparser.sourceforge.net/) 与 URL# 结合使用 getInputStream() 用于解析 Internet 上托管的 HTML 页面的内容。
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
你可以看看 httpunit 是如何做到的。 他们使用几个不错的 html 解析器,其中一个是 nekohtml。
至于获取数据,您可以使用jdk内置的内容(httpurlconnection),或者使用apache的
http ://hc.apache.org/httpclient-3.x/
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
如果您想利用任何结构或语义标记,您可能需要探索将 HTML 转换为 XML 并使用 XQuery 以标准形式提取信息。 请查看这篇 IBM DeveloperWorks 文章,了解一些典型代码,摘录如下(它们输出 HTML,当然这不是必需的):
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
简而言之,您可以解析整个页面并选择您需要的内容(为了提高速度,我建议查看 SAXParser),或者通过修剪所有 HTML 的正则表达式运行 HTML...您也可以将其全部转换为 DOM,但这会很昂贵,特别是如果您想要获得不错的吞吐量。
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
您似乎想要屏幕抓取。 您可能想要编写一个框架,通过每个源站点的适配器/插件(因为每个站点的格式会有所不同),您可以解析 html 源并提取文本。 您可能会使用 java 的 io API 连接到 URL 并通过 InputStreams 传输数据。
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
如果您想以老式方式执行此操作,则需要使用套接字连接到网络服务器的端口,然后发送以下数据:
然后使用 Socket#getInputStream ,然后使用读取数据BufferedReader ,并使用您喜欢的任何内容解析数据。
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
then use the
Socket#getInputStream
, and then read the data using a BufferedReader , and parse the data using whatever you like.您可以使用 nekohtml 来解析您的 html 文档。 你将得到一个 DOM 文档。 您可以使用 XPATH 来检索您需要的数据。
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
如果您的“网络源”是使用 HTML 的常规网站(而不是像 RSS 这样的结构化 XML 格式),我建议您查看 HTMLUnit。
该库虽然是针对测试的,但却是一个真正通用的“Java 浏览器”。 它基于 Apache httpclient、Nekohtml 解析器和 Rhino 构建,以支持 Javascript。 它为网页提供了一个非常好的 API,并允许轻松地遍历网站。
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.