关于如何识别页面的主要内容有什么想法吗?

发布于 2024-12-01 21:59:10 字数 123 浏览 0 评论 0原文

如果您必须识别页面的主要文本(例如在博客页面上识别帖子的内容)您会做什么?您认为最简单的方法是什么?

  1. 使用 cURL 获取页面内容
  2. 也许使用 DOM 解析器来识别页面的元素

if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?

  1. Get the page content with cURL
  2. Maybe use a DOM parser to identify the elements of the page

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

泛泛之交 2024-12-08 21:59:11

这是一项相当艰巨的任务,但我会从计算 DOM 元素内部的空格开始。人类可读内容的一个明显标志是空格和句号。大多数文章似乎将内容封装在段落标签中,因此您可以查看带有 n 个空格和至少一个标点符号的所有 p 标签。

您还可以使用元素内分组段落标签的数量。因此,如果一个 div 有 N 个段落子级,那么它很可能就是您想要提取的内容。

That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.

You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.

锦上情书 2024-12-08 21:59:11

有一些框架可以对此进行存档,其中之一是 http://code.google.com/p /boilerpipe/ 使用一些统计数据。
一些可以检测主要内容的 html 块的功能:

  1. p、div 标签
  2. 内部/外部的文本数量
  3. 内部/外部的链接数量(即删除 munus)
  4. 一些 css 类名称和 id(通常这些块具有带有 main、main_block 的类或 id) 、内容等)
  5. 内容中标题和文本之间的关系

There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics.
Some features that can detect html block with main content:

  1. p, div tags
  2. amount of text inside/outside
  3. amount of links inside/outside (i.e remove munus)
  4. some css class names and id (frequntly those block have classes or ids with main, main_block, content e.t.c)
  5. relation between title and text inside content
尐偏执 2024-12-08 21:59:11

您可能会考虑:

  • Boilerpipe:“boilerpipe 库提供了算法来检测和消除多余的”杂乱信息“(样板文件、模板)围绕网页的主要文本内容。该库已经为常见任务(例如:新闻文章提取)提供了特定策略,并且还可以轻松扩展用于个别问题设置。”
  • Ruby Readability:“Ruby Readability 是一个用于提取网页主要可读内容的工具。它是arc90 可读性项目的 Ruby 端口。”
  • Readability API:“如果您想直接访问 Readability 解析器,Content API 是如果您有兴趣,请联系我们。”

You might consider:

  • Boilerpipe: "The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings."
  • Ruby Readability: "Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project."
  • The Readability API: "If you'd like access to the Readability parser directly, the Content API is available upon request. Contact us if you're interested."
苏大泽ㄣ 2024-12-08 21:59:11

似乎最好的答案是“这取决于”。例如,这取决于相关网站的标记方式。

  1. 如果作者使用“常见”标签,您可以寻找一个容器
    元素 ID 为“内容”或“主要”。
  2. 如果作者使用 HTML5,理论上您应该能够查询
    元素(如果该页面只有一个“故事”可讲)。

It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.

  1. If the author uses "common" tags, you could look for a container
    element ID'd as "content" or "main."
  2. If the author is using HTML5, you should in theory be able to query for the <article> element, if it's a page with only one "story" to tell.
残月升风 2024-12-08 21:59:11

最近我遇到了同样的问题。我开发了一个新闻文章抓取工具,我必须检测文章页面的主要文本内容。许多新闻网站在“主要文章”旁边显示大量其他文本内容(例如“阅读下一篇”、“您可能感兴趣”)。我的第一个方法是收集

标记之间的所有文本。但这不起作用,因为有些新闻网站也将

用于其他元素,如导航、“阅读更多”等。不久前,我偶然发现了 Boilerpipe 库

该库已经为常见任务(例如:新闻文章提取)提供了特定策略,并且还可以轻松扩展以用于个别问题设置。

这听起来像是解决我的问题的完美解决方案,但事实并非如此。它在许多新闻网站上都失败了,因为它通常无法解析新闻文章的全文。我不知道为什么,但认为boilerpipe算法不能处理写得不好的html。所以在很多情况下它只是返回一个空字符串而不是新闻文章的主要内容。

在这次糟糕的经历之后,我尝试开发自己的“文章文本提取器”算法。主要思想是将 html 分割成不同的深度,例如:

<html>  
<!-- depth: 1 -->
<nav>
  <!-- depth: 2 -->
   <ul>
      <!-- depth: 3 -->
      <li><a href="/mhh">Site<!-- depth: 5 --></a></li>
      <li><a href="/bla">Site<!--- depth: 5 ---></a></li>
  </ul>
</nav>
<div id='text'>
  <!--- depth: 2 --->
  <p>Thats the main content...<!-- depth: 3 --></p>
  <p>main content, bla, bla bla ... <!-- depth: 3 --></p>
  <p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
  <p>whatever, bla... <!-- depth: 3 --></p>
</div>

</html>

如您所见,要使用此算法过滤掉多余的“混乱”,导航元素、“您可能喜欢”部分等内容必须位于与主要内容不同的深度。或者换句话说:多余的“混乱”必须用比主要文本内容更多(或更少)的 html 标签来描述。

  1. 计算每个 html 元素的深度。
  2. 找到文本内容量最多的深度。
  3. 选择具有此深度的所有文本内容

为了证明这个概念,我编写了一个 Ruby 脚本,效果很好,大多数新闻网站。除了 Ruby 脚本之外,我还开发了 textracto.com api,您可以免费使用它。

问候,
大卫

Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p> tags. But this did't work because there were news sites that used the <p> for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.

After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:

<html>  
<!-- depth: 1 -->
<nav>
  <!-- depth: 2 -->
   <ul>
      <!-- depth: 3 -->
      <li><a href="/mhh">Site<!-- depth: 5 --></a></li>
      <li><a href="/bla">Site<!--- depth: 5 ---></a></li>
  </ul>
</nav>
<div id='text'>
  <!--- depth: 2 --->
  <p>Thats the main content...<!-- depth: 3 --></p>
  <p>main content, bla, bla bla ... <!-- depth: 3 --></p>
  <p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
  <p>whatever, bla... <!-- depth: 3 --></p>
</div>

</html>

As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.

  1. Calculate the depth of every html element.
  2. Find the depth with the highest amount of textual content.
  3. Select all textual content with this depth

To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.

Greetings,
David

手长情犹 2024-12-08 21:59:11

这很大程度上取决于页面。您事先了解页面的结构吗?如果幸运的话,它可能会提供您可以使用的 RSS 提要,或者可能会使用一些新的 HTML5 标记进行标记,例如

It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>, <section> etc. (which carry more semantic power than pre-HTML5 tags).

如梦初醒的夏天 2024-12-08 21:59:11

我已经将原始的boilerpipe java代码移植到纯ruby实现中 Ruby Boilerpipe 也是一个Jruby版本包装原始 Java 代码 Jruby锅炉管

I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文