查找 HTML 部分文档的内容

发布于 2024-07-30 06:14:54 字数 660 浏览 5 评论 0原文

这实际上并不是一个编程问题,更多的是一个算法问题。

问题:查找 HTML 页面的“内容”部分。

我所说的“内容”是指包含人类所看到的页面内容的 dom,没有噪音,只是“页面实际内容”。 我知道问题没有明确定义,但让我们继续...... 例如,在博客网站中,这通常很容易,当浏览到特定帖子时,您通常在页面顶部有一些工具栏,也许在 LHS 上有一些导航元素,然后您有包含内容的 div。 尝试从 HTML 中找出这一点可能很棘手。 不过,幸运的是,大多数博客都有 RSS 提要,并且在该特定帖子的提要中,您会发现。 部分(或),这正是您想要的。 因此,为了细化内容的定义,这是页面上包含有趣部分的实际内容,删除了所有广告、导航元素等。 因此,从博客中查找内容相对容易,假设博客有 RSS。 其他 RSS 支持网站也是如此。

那么新闻网站呢? 在许多情况下,新闻网站都有 RSS,但并非总是如此。 那么如何在新闻网站上查找内容呢? 更一般的网站呢? 许多网页(当然不是全部)都有内容部分和其他部分。 你能想出一个好的算法来找到“有趣”的部分和不那么有趣的部分吗? 也许是那些没有改变的部分发生了变化?

希望我已经说清楚了...谢谢!

This is not really a programming question, more of an algorithmic one.

The problem: Finding the "content" section of an HTML page.

By "content" I mean the dom that contains the page content as seen by humans, without the noise, simply the "page actual content".
I know the problem is not well defined, but let's continue...
For example in blog sites, this is is usually easy, when browsing to a specific post you usually have some toolbars at the top of the page, maybe some navigation elements on the LHS and then you have the div that contains the content. Trying to figure this out from the HTML can be tricky. Luckily, however, most blogs have RSS feeds and in the feed for this specific post you'd find a <description> section (or <content:encoded>) and this is exactly what you want.
So, to refine the definition of content, this is the actual thing on the page that contains the interesting part, removing all the ads, navigation elements etc.
So finding content from blogs is relatively easy, assuming they have RSS. Same goes for other RSS supportive sites.

What about news sites? In many cases news sites have RSS, but not always. How does one find content on news sites then?
What about more general sites? Many web pages (of course not all of them) have content section and other sections. Can you think of a good algorithm to find the sections that are "interesting" v/s the less interesting? Perhaps the sections that change from those that do not change?

Hope I've made myself clear... Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤君无依 2024-08-06 06:14:54

我还没有这样做,但这将是我的一般方法。

正如您所指出的,可见内容部分缺乏结构(即没有 headernavigationads 等标签) HTML 意味着更难定位页面的关键部分。 我的方法是首先删除您确定不感兴趣的不同元素。 可能的排除列表可以是:

  • 元元素,例如 !doctypehead(将 title 作为单独的数据片段)
  • 动态元素例如objectembedappletscript
  • 图像(取决于是否要保留它们), img
  • 表单元素,即forminputtextarealabellegendselectoption

第二遍可以开始排除常见的 divul > id/class 名称,以及其中的所有标签,例如:

  • headerfootermeta
  • nav导航topnav侧边栏
  • 广告广告adu (以及广告常用的其他名称)

这有望从页面中删除大量装饰。 下一个挑战是尝试从剩下的内容中识别主要内容,我建议首先假设网站作者正确使用语义 HTML,因此主要使用 h1h2 头部标签和 p 段落标签。

为了识别内容,我会查找任何标题标签,然后跟随一个段落标签。 (对于您的主要内容,这可能是 h2h1 标签经常(并且可以说是不正确的)用于显示网站名称或徽标,但这有望被消除通过排除页面的标题部分。)每个后续段落都应添加到当前内容中,直到到达中断处,该中断可能是 divtd 的末尾> 元素,或者它可能是与您开始的同一级别的标题元素。

由于您在页面上可能仍然收集了几组内容(可能是主要内容加上有关作者的简介),因此您需要在此处测试和完善决策步骤,以选择最有可能的候选者。 无论是在长度还是所使用的段落元素数量方面,这通常都是最大的。

当您收集更多内容示例时,您可以向算法添加支持措施; 您可能会注意到许多页面使用 div id="content"id="maincontent"。 保留您检测到的次要内容项也可能很有用,这样,如果某些网站有一种奇怪的内容结构方式,那么一旦您将捕获器添加到算法中,它就可以针对此重新运行网站的内容。

I haven't done this, but this would be my general approach.

As you indicate, the lack of structure in the visible content parts (i.e. it doesn't have tags such as header, navigation, ads) of HTML means it is harder to home in on the key part of the page. My approach would be to first remove distinct elements which you have definitely decided are not interesting. A possible list of exclusions could be:

  • meta elements such as !doctype, head (take the title as a separate piece of data)
  • dynamic elements such as object, embed, applet, script
  • images (depending on whether want to retain them or not), img
  • form elements, i.e. form, input, textarea, label, legend, select, option

A second pass could then start to exclude commonly occurring div or ul id/class names, and all tags within them, such as:

  • header, footer, meta
  • nav, navigation, topnav, sidebar
  • ad, ads, adu (and other names commonly used for ads)

This will hopefully remove a significant amount of decoration from the page. The next challenge is to try to identify the main content from what's left, and I would suggest initially assuming that the site author is using semantic HTML properly, and so is principally using the h1, h2 head tags and the p paragraph tag.

To identify content, I would look for any header tag which is then followed by a paragraph tag(s). (This may be h2 for your main content; the h1 tag is often (and arguably incorrectly) used to display the site name or logo, but this will hopefully have been eliminated by excluding the header parts of the page.) Each subsequent paragraph should be added to the current content until you reach a break, which could either be the end of the div or td element, or it could be a header element of the same level you started from.

As there may still be several sets of content that you've gathered on the page (maybe the main content plus the blurb about the author), you need to test and refine a decision-making step here which chooses the most likely candidate. This will often simply be the largest, both in terms of length and number of paragraph elements used.

As you gather more examples of content, you can add supporting measures to your algorithm; this might be that you notice many of the pages use div id="content" or id="maincontent". It may also be useful to retain the secondary items of content that you detected, so that if certain sites have a curious way of structuring the content, then once you've add a catcher into your algorithm it can be re-run against just this site's content.

沧桑㈠ 2024-08-06 06:14:54

一个结构良好的网站将有其公共区域重用相同的代码,例如导航、标题等。

当您有想要分析的目标页面时,请尝试浏览同一域/子域下的一些其他页面并查找元素这是所有页面共有的。 这些是您想要消除的噪音。

然后你可以看看剩下的内容,看看是否有一些噪音溜进来。当你收集了合理数量的这些数据后,尝试在其中找到一些模式。 完善你的逻辑并重复。

A well structured site will have its common areas reusing the same code, e.g. navigation, header and etc.

When you have a target page that you would like to analyze, try browse through a few other pages under the same domain/subdomain and find elements which are common to all pages. Those are the noises you want to get rid of.

Then you can take a look at what's remaining, to see if some noises slipped in. When you have collected a reasonable amount of those data, try to find some pattern in them. Refine your logic and repeat.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文