我可以使用 Hpricot 查找任何/大多数网站的主要文章文本吗?

发布于 2024-09-10 00:48:57 字数 1435 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏有森光若流苏 2024-09-17 00:48:57

您当然可以使用 Hpricot 从任何给定的 HTML 页面中抓取内容。

这是分步教程: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Hpricot 非常适合使用 XPath 表达式。

但是,您将很难编写任何可以读取任何网页并识别主要文章文本的通用内容。我认为你需要某种基本的人工智能(至少)来实现这一点,这远远超出了 Hpricot 的能力范围。

您可以做的也许是为您想要抓取的常见 HTML 格式(可能是 Wordpress、Tumblr、Blogger 等)编写一组代码(如果有这样一组代码)。

我也确信您也可以想出一些启发式来尝试它(基于如何良好的可读性是我猜他们所做的 - 看起来它的工作远非完美)

首先尝试启发式:

1)识别(一组固定的)标签,这些标签可以被认为是“主要文本块”的一部分(例如


等)。

2) 抓取页面并找到页面上仅包含 (1) 中的标签的最大文本块。

3) 从 (2) 返回文本,并删除 (1) 中的标签。

看看可读性的结果,我认为这种启发式也能发挥作用。

You certainly can use Hpricot to scrape content from any given HTML page.

Here is a step-by-step tutorial: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Hpricot is ideal for parsing a file with a known HTML structure using XPath expressions.

However, you will struggle to write anything generic that can read any web page and identify the main article text. I think you'd need some sort of rudimentary AI for that (at least) which is well outside the scope of what Hpricot can do.

What you could do is perhaps write a set of code for the common HTML formats you want to scrape (perhaps Wordpress, Tumblr, Blogger etc) if there is such a set.

I am also sure you could come up with some heuristics for attempting it as well (which based on how well Readability works is what I guess they do - it seems it works far from perfectly)

First stab at a heuristic:

1) Identify (a fixed) set of tags which could be considered to be part of "the main block of text" (e.g. <p> <br> <img> etc).

2) Scrape page and find the largest block of text on the page that only contains tags in (1).

3) Return text from (2) with tags from (1) removed.

Looking at the results of Readability, I reckon this heuristic would work about as well.

无需解释 2024-09-17 00:48:57

实际上,可读性是一个开源项目,托管于:http://code.google.com/p/ arc90labs-readability/

阅读主文件后,我看不出有什么理由不能在 ruby​​ 中重新实现它。这是主文件
http://code.google.com/ p/arc90labs-readability/source/browse/trunk/js/readability.js

我建议您查看一下grabArticle函数,看看他们使用哪些指标以及如何做到这一点。

至于你应该使用哪个库来解析和处理 dom,你有多种选择:
nokogiri, libxml-rubyhpricot,...

所有这些都有一个相当不错的文档也是如此。

Actually readability is an opensource project hosted at : http://code.google.com/p/arc90labs-readability/

After reading the main file I don't see any reason why you couldn't reimplement it in ruby. This is the main file
http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js

I suggest you have a look at the grabArticle function to see which metrics they use and how they do it.

As for which lib you should use to parse and process the dom you have multiple choices :
nokogiri, libxml-ruby, hpricot,...

All of these have a pretty decent documentation too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文