关于如何识别页面的主要内容有什么想法吗?
如果您必须识别页面的主要文本(例如在博客页面上识别帖子的内容)您会做什么?您认为最简单的方法是什么?
- 使用 cURL 获取页面内容
- 也许使用 DOM 解析器来识别页面的元素
if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?
- Get the page content with cURL
- Maybe use a DOM parser to identify the elements of the page
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这是一项相当艰巨的任务,但我会从计算 DOM 元素内部的空格开始。人类可读内容的一个明显标志是空格和句号。大多数文章似乎将内容封装在段落标签中,因此您可以查看带有 n 个空格和至少一个标点符号的所有 p 标签。
您还可以使用元素内分组段落标签的数量。因此,如果一个 div 有 N 个段落子级,那么它很可能就是您想要提取的内容。
That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.
You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.
有一些框架可以对此进行存档,其中之一是 http://code.google.com/p /boilerpipe/ 使用一些统计数据。
一些可以检测主要内容的 html 块的功能:
There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics.
Some features that can detect html block with main content:
您可能会考虑:
You might consider:
似乎最好的答案是“这取决于”。例如,这取决于相关网站的标记方式。
元素 ID 为“内容”或“主要”。
元素(如果该页面只有一个“故事”可讲)。
It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.
element ID'd as "content" or "main."
<article>
element, if it's a page with only one "story" to tell.最近我遇到了同样的问题。我开发了一个新闻文章抓取工具,我必须检测文章页面的主要文本内容。许多新闻网站在“主要文章”旁边显示大量其他文本内容(例如“阅读下一篇”、“您可能感兴趣”)。我的第一个方法是收集
标记之间的所有文本。但这不起作用,因为有些新闻网站也将
用于其他元素,如导航、“阅读更多”等。不久前,我偶然发现了 Boilerpipe 库。
这听起来像是解决我的问题的完美解决方案,但事实并非如此。它在许多新闻网站上都失败了,因为它通常无法解析新闻文章的全文。我不知道为什么,但认为boilerpipe算法不能处理写得不好的html。所以在很多情况下它只是返回一个空字符串而不是新闻文章的主要内容。
在这次糟糕的经历之后,我尝试开发自己的“文章文本提取器”算法。主要思想是将 html 分割成不同的深度,例如:
如您所见,要使用此算法过滤掉多余的“混乱”,导航元素、“您可能喜欢”部分等内容必须位于与主要内容不同的深度。或者换句话说:多余的“混乱”必须用比主要文本内容更多(或更少)的 html 标签来描述。
为了证明这个概念,我编写了一个 Ruby 脚本,效果很好,大多数新闻网站。除了 Ruby 脚本之外,我还开发了 textracto.com api,您可以免费使用它。
问候,
大卫
Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between
<p>
tags. But this did't work because there were news sites that used the<p>
for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.
After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:
As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.
To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.
Greetings,
David
这很大程度上取决于页面。您事先了解页面的结构吗?如果幸运的话,它可能会提供您可以使用的 RSS 提要,或者可能会使用一些新的 HTML5 标记进行标记,例如
、
It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like
<article>
,<section>
etc. (which carry more semantic power than pre-HTML5 tags).我已经将原始的boilerpipe java代码移植到纯ruby实现中 Ruby Boilerpipe 也是一个Jruby版本包装原始 Java 代码 Jruby锅炉管
I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe