以编程方式检测“最重要的内容” 在一页上

发布于 2024-07-24 22:48:50 字数 220 浏览 9 评论 0原文

已经做了哪些工作(如果有的话)来自动确定 html 文档中最重要的数据? 举个例子,想象一下您的标准新闻/博客/杂志风格的网站,其中包含导航(可能带有子菜单)、广告、评论和奖品 - 我们的文章/博客/新闻正文。

您如何以自动方式确定新闻/博客/杂志上的哪些信息是主要数据?

注意:理想情况下,该方法适用于格式良好的标记和糟糕的标记。 是否有人使用段落标签来组成段落或一系列中断。

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

滥情哥ㄟ 2024-07-31 22:48:50

可读性 在这方面做得很好。

它是开源的,发布在 Google 代码上


更新:我看到(通过 HN)有人使用 Readability 来自动将 RSS 提要转换为更有用的格式

Readability does a decent job of exactly this.

It's open source and posted on Google Code.


UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.

琉璃繁缕 2024-07-31 22:48:50

想象一下您的标准新闻/博客/杂志风格的网站,其中包含导航(可能带有子菜单)、广告、评论和奖品 - 我们的文章/博客/新闻正文。

您如何以自动方式确定新闻/博客/杂志上的哪些信息是主要数据?

我可能会尝试这样的事情:

  • 打开 URL
  • 从该页面读取到同一网站的所有链接,
  • 跟踪所有链接并为每个 URL(HTML 文件)构建一个 DOM 树,
  • 这应该可以帮助您提出冗余内容(包括模板等)
  • 比较同一站点上所有文档的 DOM 树(树行走)
  • 剥离所有冗余节点(即重复的、导航标记、广告等)
  • 尝试识别相似的节点,并剥离(如果可能)
  • 找到在以下位置找不到的最大唯一文本块该网站上的其他 DOM(即独特的内容)
  • 添加为进一步处理的候选者

这种方法似乎很有前途,因为它做起来相当简单,但仍然具有良好的自适应潜力,即使对于复杂的 Web 2.0 页面也是如此。过度使用模板,因为它会识别同一网站上所有页面之间的相似 HTML 节点。

通过简化使用评分系统来跟踪先前被识别为包含唯一内容的 DOM 节点,这可能会得到进一步改进,以便这些节点优先用于其他页面。

think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

I would probably try something like this:

  • open URL
  • read in all links to same website from that page
  • follow all links and build a DOM tree for each URL (HTML file)
  • this should help you come up with redundant contents (included templates and such)
  • compare DOM trees for all documents on same site (tree walking)
  • strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
  • try to identify similar nodes and strip if possible
  • find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
  • add as candidate for further processing

This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.

This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.

花想c 2024-07-31 22:48:50

有时,CSS 媒体部分定义为“打印”。 它的预期用途是“单击此处打印此页”链接。 通常人们用它来去掉很多多余的信息,只留下信息的实质。

http://www.w3.org/TR/CSS2/media.html

我会尝试阅读这种风格,然后刮掉任何可见的东西。

Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.

http://www.w3.org/TR/CSS2/media.html

I would try to read this style, and then scrape whatever is left visible.

挽袖吟 2024-07-31 22:48:50

您可以使用支持向量机进行文本分类。 一种想法是将页面分成不同的部分(假设每个结构元素就像 div 是一个文档)并收集它的一些属性并将其转换为向量。 (正如其他人所建议的,这可能是单词数、链接数、图像数,越多越好。)

首先从一大组文档(100-1000)开始,您已经选择了哪个部分是主要部分。 然后使用这个集合来训练你的 SVM。

对于每个新文档,您只需将其转换为向量并将其传递给 SVM。

这个向量模型实际上在文本分类中非常有用,你不一定需要使用 SVM。 您也可以使用更简单的贝叶斯模型。

如果您有兴趣,可以在信息简介中找到更多详细信息检索。 (可在线免费获取)

You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)

First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.

And for each new document you just need to convert it to vector and pass it to SVM.

This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.

And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)

抚你发端 2024-07-31 22:48:50

我认为最直接的方法是寻找最大的没有标记的文本块。 然后,一旦找到它,找出它的边界并提取它。 您可能希望从“非标记”中排除某些标签,例如链接和图像,具体取决于您的目标。 如果它有一个界面,可能会包含一个要从搜索中排除的标签的复选框列表。

您还可以查找 DOM 树中的最低级别,并找出其中哪个元素最大,但这在写得不好的页面上效果不佳,因为 dom 树经常在此类页面上损坏。 如果您最终使用此功能,我会想出一些方法来在尝试之前查看浏览器是否已进入怪异模式。

您也可以尝试使用其中的多项检查,然后提出一个指标来决定哪个是最好的。 例如,仍然尝试使用上面的第二个选项,但如果浏览器正常进入怪癖模式,则给其结果较低的“评级”。 这样做显然会影响性能。

I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.

You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.

You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.

初见你 2024-07-31 22:48:50

我认为一个非常有效的算法可能是“哪个 DIV 中的文本最多,但链接很少?”

广告很少有超过两三个句子的文字。 例如,请查看本页的右侧。

内容区域几乎总是页面上宽度最大的区域。

I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"

Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.

The content area is almost always the area with the greatest width on the page.

风筝在阴天搁浅。 2024-07-31 22:48:50

我可能会从标题和 Head 标签中的其他任何内容开始,然后按顺序过滤标题标签(即 h1、h2、h3 等)...除此之外,我想我会按顺序从上到下底部。 根据其样式,可以安全地假设页面标题具有 ID 或唯一类。

I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.

小嗲 2024-07-31 22:48:50

我会寻找带有标点符号的句子。 菜单、页眉、页脚等通常包含单独的单词,但不包含以逗号结尾和以句点或等效标点符号结尾的句子。

您可以查找包含带标点符号的句子的第一个和最后一个元素,并获取其间的所有内容。 标题是一种特殊情况,因为它们通常也没有标点符号,但您通常可以将它们识别为紧邻句子之前的 Hn 元素。

I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.

You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.

所谓喜欢 2024-07-31 22:48:50

虽然这显然不是答案,但我认为重要内容位于样式页面的中心附近,并且通常由几个被标题等打断的块组成。 结构本身也可能是标记中的一个泄露点。

文章/帖子/线程之间的差异将是一个很好的过滤器,可以找出哪些内容区分特定页面(显然,必须对此进行增强,以过滤掉广告、“当日报价”或横幅等随机垃圾)。 多个页面的内容结构可能非常相似,因此不要过度依赖结构差异。

While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.

A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.

笔落惊风雨 2024-07-31 22:48:50

Instapaper 在这方面做得很好。 您可能需要查看 Marco Arment 的博客,了解有关他如何做到这一点的提示。

Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.

望她远 2024-07-31 22:48:50

如今,大多数新闻/博客网站都使用博客平台。
所以我会创建一组搜索内容的规则。
例如,两个最流行的博客平台是 wordpress 和 Google Blogspot。

WordPress 帖子的标记为:

<div class="entry">
    ...
</div>

Blogspot 帖子的标记为:

<div class="post-body">
    ...
</div>

如果通过 css 类搜索失败,您可以转向其他解决方案,识别最大的文本块等。

Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.

Wordpress posts are marked by:

<div class="entry">
    ...
</div>

Blogspot posts are marked by:

<div class="post-body">
    ...
</div>

If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.

っ左 2024-07-31 22:48:50

由于 Readability 不再可用:

As Readability is not available anymore:

  • If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
  • If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
  • If you prefer Java, you can take a look at Crux, which does also pretty good job.
  • Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文