Hpricot is ideal for parsing a file with a known HTML structure using XPath expressions.
However, you will struggle to write anything generic that can read any web page and identify the main article text. I think you'd need some sort of rudimentary AI for that (at least) which is well outside the scope of what Hpricot can do.
What you could do is perhaps write a set of code for the common HTML formats you want to scrape (perhaps Wordpress, Tumblr, Blogger etc) if there is such a set.
I am also sure you could come up with some heuristics for attempting it as well (which based on how well Readability works is what I guess they do - it seems it works far from perfectly)
First stab at a heuristic:
1) Identify (a fixed) set of tags which could be considered to be part of "the main block of text" (e.g. <p><br><img> etc).
2) Scrape page and find the largest block of text on the page that only contains tags in (1).
3) Return text from (2) with tags from (1) removed.
Looking at the results of Readability, I reckon this heuristic would work about as well.
发布评论
评论(2)
您当然可以使用 Hpricot 从任何给定的 HTML 页面中抓取内容。
这是分步教程: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Hpricot 非常适合使用 XPath 表达式。
但是,您将很难编写任何可以读取任何网页并识别主要文章文本的通用内容。我认为你需要某种基本的人工智能(至少)来实现这一点,这远远超出了 Hpricot 的能力范围。
您可以做的也许是为您想要抓取的常见 HTML 格式(可能是 Wordpress、Tumblr、Blogger 等)编写一组代码(如果有这样一组代码)。
我也确信您也可以想出一些启发式来尝试它(基于如何良好的可读性是我猜他们所做的 - 看起来它的工作远非完美)
首先尝试启发式:
1)识别(一组固定的)标签,这些标签可以被认为是“主要文本块”的一部分(例如
等)。
2) 抓取页面并找到页面上仅包含 (1) 中的标签的最大文本块。
3) 从 (2) 返回文本,并删除 (1) 中的标签。
看看可读性的结果,我认为这种启发式也能发挥作用。
You certainly can use Hpricot to scrape content from any given HTML page.
Here is a step-by-step tutorial: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Hpricot is ideal for parsing a file with a known HTML structure using XPath expressions.
However, you will struggle to write anything generic that can read any web page and identify the main article text. I think you'd need some sort of rudimentary AI for that (at least) which is well outside the scope of what Hpricot can do.
What you could do is perhaps write a set of code for the common HTML formats you want to scrape (perhaps Wordpress, Tumblr, Blogger etc) if there is such a set.
I am also sure you could come up with some heuristics for attempting it as well (which based on how well Readability works is what I guess they do - it seems it works far from perfectly)
First stab at a heuristic:
1) Identify (a fixed) set of tags which could be considered to be part of "the main block of text" (e.g.
<p>
<br>
<img>
etc).2) Scrape page and find the largest block of text on the page that only contains tags in (1).
3) Return text from (2) with tags from (1) removed.
Looking at the results of Readability, I reckon this heuristic would work about as well.
实际上,可读性是一个开源项目,托管于:http://code.google.com/p/ arc90labs-readability/
阅读主文件后,我看不出有什么理由不能在 ruby 中重新实现它。这是主文件
http://code.google.com/ p/arc90labs-readability/source/browse/trunk/js/readability.js
我建议您查看一下grabArticle函数,看看他们使用哪些指标以及如何做到这一点。
至于你应该使用哪个库来解析和处理 dom,你有多种选择:
nokogiri, libxml-ruby,hpricot,...
所有这些都有一个相当不错的文档也是如此。
Actually readability is an opensource project hosted at : http://code.google.com/p/arc90labs-readability/
After reading the main file I don't see any reason why you couldn't reimplement it in ruby. This is the main file
http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js
I suggest you have a look at the grabArticle function to see which metrics they use and how they do it.
As for which lib you should use to parse and process the dom you have multiple choices :
nokogiri, libxml-ruby, hpricot,...
All of these have a pretty decent documentation too.