Readability 使用什么算法从 URL 中提取文本?
一段时间以来,我一直在尝试找到一种方法,通过消除与广告相关的文本和所有其他杂乱内容,从 URL 中智能地提取“相关”文本。经过几个月的研究,我放弃了它。是无法准确确定的。 (我尝试过不同的方法,但没有一个可靠)
一周前,我偶然发现了可读性 - 一个将任何 URL 转换为可读文本的插件。对我来说看起来非常准确。我的猜测是,他们以某种方式拥有一种足够智能的算法来提取相关文本。
有谁知道他们是怎么做到的?或者我怎样才能可靠地做到这一点?
For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)
A week back, I stumbled across Readability - a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.
Does anyone know how they do it? Or how I could do it reliably?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
可读性主要由在许多情况下“以某种方式运作良好”的启发式组成。
我已经写了一些关于这个主题的研究论文,我想解释一下为什么很容易想出一个运行良好的解决方案,以及为什么很难达到接近 100% 的准确度。
人类语言中似乎存在一种语言法则,它也(但不完全)体现在网页内容中,它已经非常清楚地区分了两种类型的文本(全文与非全文,或者粗略地说,“主要内容”与“样板文件”)。
为了从 HTML 获取主要内容,在许多情况下,仅保留大约 10 个单词以上的 HTML 文本元素(即,未被标记中断的文本块)就足够了。人类似乎出于两种不同的书写文本动机而从两种类型的文本(“短”和“长”,以它们发出的单词数来衡量)中进行选择。我将它们称为“导航”和“信息”动机。
如果作者希望您快速了解所写内容,他/她会使用“导航”文本,即几个单词(例如“停止”、“阅读此内容”、“单击此处”)。这是导航元素(菜单等)中最突出的文本类型。
如果作者希望您深入理解他/她的含义,他/她会使用很多单词。这样,以增加冗余为代价消除了歧义。类似文章的内容通常属于此类,因为它不仅仅只有几个单词。
虽然这种分离似乎适用于很多情况,但对于标题、短句、免责声明、版权页脚等来说,它变得很棘手。
有更复杂的策略和功能可以帮助将主要内容与样板内容分离。例如,链接密度(块中链接的单词数与块中单词总数的关系)、前一个/下一个块的特征、“整个”Web 中特定块文本的频率、 HTML文档的DOM结构,页面的视觉图像等。
您可以阅读我的最新文章“使用浅层文本特征进行样板检测”,从理论角度获得一些见解。您还可以在 VideoLectures.net 上观看我的论文演示视频。
“可读性”使用了其中一些功能。如果你仔细观察 SVN 变更日志,你会发现策略的数量随着时间的推移而变化,可读性的提取质量也随之变化。例如,2009 年 12 月引入的链路密度对改善有很大帮助。
因此,在我看来,在不提及确切版本号的情况下说“可读性确实如此”是没有意义的。
我发布了一个名为 boilerpipe 的开源 HTML 内容提取库,它提供了几种不同的提取策略。根据使用情况,一个或另一个提取器效果更好。您可以使用 Google AppEngine 上的配套锅炉管道网络应用程序在您选择的页面上尝试这些提取器。
要让数字说话,请参阅boilerpipe wiki 上的“基准”页面,该页面比较了一些提取策略,包括boilerpipe、Readability 和Apple Safari。
我应该提到的是,这些算法假设主要内容实际上是全文。在某些情况下,“主要内容”是其他内容,例如图像、表格、视频等。算法对于这种情况不能很好地工作。
Readability mainly consists of heuristics that "just somehow work well" in many cases.
I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.
There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").
To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.
If an author wants you to quickly get what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)
If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.
While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.
There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.
You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.
"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.
In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.
I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.
To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.
I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.
Readability 是一个 JavaScript 书签。意思是操作 DOM 的客户端代码。查看 javascript,您应该能够看到发生了什么。
Readability 的工作流程和代码:
如果您遵循上述代码引入的 JS 和 CSS 文件,您将获得全貌:
http://lab.arc90.com/experiments/readability/js/readability.js (这是评论得很好,有趣的阅读)
http://lab.arc90.com/experiments/readability/css/readability.css
readability is a javascript bookmarklet. meaning its client side code that manipulates the DOM. Look at the javascript and you should be able to see whats going on.
Readability's workflow and code:
And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:
http://lab.arc90.com/experiments/readability/js/readability.js (this is pretty well commented, interesting reading)
http://lab.arc90.com/experiments/readability/css/readability.css
当然,没有 100% 可靠的方法可以做到这一点。您可以在此处查看 Readability 源代码
基本上,它们是什么我们正在做的是尝试识别积极和消极文本块。正面标识符(即 div ID)类似于:
负面标识符类似于:
然后他们有不太可能和可能候选者。
他们要做的是确定最有可能成为网站主要内容的内容,请参阅可读性源中的第
678
行。这是通过主要分析段落的长度、它们的标识符(见上文)、DOM 树(即,如果该段落是最后一个子节点)、删除所有不必要的内容、删除格式等来完成的。代码有 1792 行。这看起来确实是一个不平凡的问题,所以也许你可以从中获得灵感。
There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here
Basically, what they're doing is trying to identify positive and negative blocks of text. Positive identifiers (i.e. div IDs) would be something like:
Negative identifiers would be:
And then they have unlikely and maybe candidates.
What they would do is determine what is most likely to be the main content of the site, see line
678
in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.
有趣的。我已经开发了一个类似的 PHP 脚本。它基本上扫描文章并将词性附加到所有文本(Brill Tagger)。然后,语法无效的句子会立即被删除。然后,代词或过去时的突然变化表明文章已经结束,或者还没有开始。重复的短语会被搜索并消除,例如“雅虎新闻体育财经”在页面中出现了十次。您还可以通过与各种情绪相关的大量单词库来获取语气统计数据。语气的突然变化,从主动/消极/财务,到被动/积极/政治,表明了一个界限。无论你想深入挖掘,它确实是无穷无尽的。
主要问题是链接、嵌入异常、脚本样式和更新。
Interesting. I have developed a similar PHP script. It basically scans articles and attaches parts of speech to all text (Brill Tagger). Then, grammatically invalid sentences are instantly eliminated. Then, sudden shifts in pronouns or past tense indicate the article is over, or hasn't started yet. Repeated phrases are searched for and eliminated, like "Yahoo news sports finance" appears ten times in the page. You can also get statistics on the tone with a plethora of word banks relating to various emotions. Sudden changes in tone, from active/negative/financial, to passive/positive/political indicates a boundary. It's endless really, however dig you want to deep.
The major issues are links, embedded anomalies, scripting styles and updates.