查找 HTML 部分文档的内容

发布于 2024-07-30 06:14:54 字数 660 浏览 10 评论 0原文

这实际上并不是一个编程问题，更多的是一个算法问题。

问题：查找 HTML 页面的“内容”部分。

我所说的“内容”是指包含人类所看到的页面内容的 dom，没有噪音，只是“页面实际内容”。我知道问题没有明确定义，但让我们继续...... 例如，在博客网站中，这通常很容易，当浏览到特定帖子时，您通常在页面顶部有一些工具栏，也许在 LHS 上有一些导航元素，然后您有包含内容的 div。尝试从 HTML 中找出这一点可能很棘手。不过，幸运的是，大多数博客都有 RSS 提要，并且在该特定帖子的提要中，您会发现。部分（或），这正是您想要的。因此，为了细化内容的定义，这是页面上包含有趣部分的实际内容，删除了所有广告、导航元素等。因此，从博客中查找内容相对容易，假设博客有 RSS。其他 RSS 支持网站也是如此。

那么新闻网站呢？在许多情况下，新闻网站都有 RSS，但并非总是如此。那么如何在新闻网站上查找内容呢？更一般的网站呢？许多网页（当然不是全部）都有内容部分和其他部分。你能想出一个好的算法来找到“有趣”的部分和不那么有趣的部分吗？也许是那些没有改变的部分发生了变化？

希望我已经说清楚了...谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤君无依 2024-08-06 06:14:54

我还没有这样做，但这将是我的一般方法。

正如您所指出的，可见内容部分缺乏结构（即没有 header、navigation、ads 等标签） HTML 意味着更难定位页面的关键部分。我的方法是首先删除您确定不感兴趣的不同元素。可能的排除列表可以是：

元元素，例如 !doctype、head（将 title 作为单独的数据片段）
动态元素例如object、embed、applet、script
图像（取决于是否要保留它们）， img
表单元素，即form、input、textarea、label、legend、select、option

第二遍可以开始排除常见的 div 或 ul > id/class 名称，以及其中的所有标签，例如：

header、footer、meta
nav、 导航、topnav、侧边栏
广告、广告、adu （以及广告常用的其他名称）

这有望从页面中删除大量装饰。下一个挑战是尝试从剩下的内容中识别主要内容，我建议首先假设网站作者正确使用语义 HTML，因此主要使用 h1、h2 头部标签和 p 段落标签。

为了识别内容，我会查找任何标题标签，然后跟随一个段落标签。（对于您的主要内容，这可能是 h2 ；h1 标签经常（并且可以说是不正确的）用于显示网站名称或徽标，但这有望被消除通过排除页面的标题部分。）每个后续段落都应添加到当前内容中，直到到达中断处，该中断可能是 div 或 td 的末尾> 元素，或者它可能是与您开始的同一级别的标题元素。

由于您在页面上可能仍然收集了几组内容（可能是主要内容加上有关作者的简介），因此您需要在此处测试和完善决策步骤，以选择最有可能的候选者。无论是在长度还是所使用的段落元素数量方面，这通常都是最大的。

当您收集更多内容示例时，您可以向算法添加支持措施；您可能会注意到许多页面使用 div id="content" 或 id="maincontent"。保留您检测到的次要内容项也可能很有用，这样，如果某些网站有一种奇怪的内容结构方式，那么一旦您将捕获器添加到算法中，它就可以针对此重新运行网站的内容。

I haven't done this, but this would be my general approach.

As you indicate, the lack of structure in the visible content parts (i.e. it doesn't have tags such as header, navigation, ads) of HTML means it is harder to home in on the key part of the page. My approach would be to first remove distinct elements which you have definitely decided are not interesting. A possible list of exclusions could be:

meta elements such as !doctype, head (take the title as a separate piece of data)
dynamic elements such as object, embed, applet, script
images (depending on whether want to retain them or not), img
form elements, i.e. form, input, textarea, label, legend, select, option

A second pass could then start to exclude commonly occurring div or ul id/class names, and all tags within them, such as:

header, footer, meta
nav, navigation, topnav, sidebar
ad, ads, adu (and other names commonly used for ads)

This will hopefully remove a significant amount of decoration from the page. The next challenge is to try to identify the main content from what's left, and I would suggest initially assuming that the site author is using semantic HTML properly, and so is principally using the h1, h2 head tags and the p paragraph tag.

To identify content, I would look for any header tag which is then followed by a paragraph tag(s). (This may be h2 for your main content; the h1 tag is often (and arguably incorrectly) used to display the site name or logo, but this will hopefully have been eliminated by excluding the header parts of the page.) Each subsequent paragraph should be added to the current content until you reach a break, which could either be the end of the div or td element, or it could be a header element of the same level you started from.

As there may still be several sets of content that you've gathered on the page (maybe the main content plus the blurb about the author), you need to test and refine a decision-making step here which chooses the most likely candidate. This will often simply be the largest, both in terms of length and number of paragraph elements used.

As you gather more examples of content, you can add supporting measures to your algorithm; this might be that you notice many of the pages use div id="content" or id="maincontent". It may also be useful to retain the secondary items of content that you detected, so that if certain sites have a curious way of structuring the content, then once you've add a catcher into your algorithm it can be re-run against just this site's content.

回复收藏 0 原文