如何从网页(Java)中抓取文本?

发布于 2024-07-05 10:32:18 字数 294 浏览 10 评论 0原文

我计划编写一个简单的 J2SE 应用程序来聚合来自多个 Web 源的信息。

我认为,最困难的部分是从网页中提取有意义的信息(如果它不能作为 RSS 或 Atom 提要提供)。 例如,我可能想从 stackoverflow 中提取问题列表,但我绝对不需要那么大的标签云或导航栏。

您会建议什么技术/库?

更新/备注

  • 速度并不重要 - 只要它能够在 10 分钟内解析大约 5MB 的 HTML 即可。
  • 这应该很简单。

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.

The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.

What technique/library would you advice?

Updates/Remarks

  • Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
  • It sould be really simple.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

情定在深秋 2024-07-12 10:32:18

您是否考虑过利用 RSS/Atom 提要? 当内容通常以可消费格式提供给您时,为什么要抓取内容呢? 有一些库可用于以您能想到的任何语言使用 RSS,并且与尝试抓取内容相比,它对页面标记的依赖要少得多。

如果您绝对必须抓取内容,请在标记中查找 微格式,大多数博客(尤其是基于 WordPress 的博客)默认都有此内容。 还有一些库和解析器可用于从网页中查找和提取微格式。

最后,诸如Yahoo Pipes之类的聚合服务/应用程序也许能够为您完成这项工作,而无需重新发明车轮。

Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.

If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.

Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.

挽容 2024-07-12 10:32:18

看看这个 http://www.alchemyapi.com/api/demo.html

他们返回相当好的结果,并且有适用于大多数平台的 SDK。 不仅进行文本提取,还进行关键词分析等。

Check this out http://www.alchemyapi.com/api/demo.html

They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

情徒 2024-07-12 10:32:18

您可以将 HTMLParser (http://htmlparser.sourceforge.net/) 与 URL# 结合使用 getInputStream() 用于解析 Internet 上托管的 HTML 页面的内容。

You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.

莫相离 2024-07-12 10:32:18

你可以看看 httpunit 是如何做到的。 他们使用几个不错的 html 解析器,其中一个是 nekohtml。
至于获取数据,您可以使用jdk内置的内容(httpurlconnection),或者使用apache的

http ://hc.apache.org/httpclient-3.x/

You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's

http://hc.apache.org/httpclient-3.x/

不气馁 2024-07-12 10:32:18

如果您想利用任何结构或语义标记,您可能需要探索将 HTML 转换为 XML 并使用 XQuery 以标准形式提取信息。 请查看这篇 IBM DeveloperWorks 文章,了解一些典型代码,摘录如下(它们输出 HTML,当然这不是必需的):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>
帅气尐潴 2024-07-12 10:32:18

简而言之,您可以解析整个页面并选择您需要的内容(为了提高速度,我建议查看 SAXParser),或者通过修剪所有 HTML 的正则表达式运行 HTML...您也可以将其全部转换为 DOM,但这会很昂贵,特别是如果您想要获得不错的吞吐量。

In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.

素罗衫 2024-07-12 10:32:18

您似乎想要屏幕抓取。 您可能想要编写一个框架,通过每个源站点的适配器/插件(因为每个站点的格式会有所不同),您可以解析 html 源并提取文本。 您可能会使用 java 的 io API 连接到 URL 并通过 InputStreams 传输数据。

You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.

稍尽春風 2024-07-12 10:32:18

如果您想以老式方式执行此操作,则需要使用套接字连接到网络服务器的端口,然后发送以下数据:

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

然后使用 Socket#getInputStream ,然后使用读取数据BufferedReader ,并使用您喜欢的任何内容解析数据。

If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.

謌踐踏愛綪 2024-07-12 10:32:18

您可以使用 nekohtml 来解析您的 html 文档。 你将得到一个 DOM 文档。 您可以使用 XPATH 来检索您需要的数据。

You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.

居里长安 2024-07-12 10:32:18

如果您的“网络源”是使用 HTML 的常规网站(而不是像 RSS 这样的结构化 XML 格式),我建议您查看 HTMLUnit

该库虽然是针对测试的,但却是一个真正通用的“Java 浏览器”。 它基于 Apache httpclient、Nekohtml 解析器和 Rhino 构建,以支持 Javascript。 它为网页提供了一个非常好的 API,并允许轻松地遍历网站。

If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.

This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文