查找 HTML 部分文档的内容
这实际上并不是一个编程问题,更多的是一个算法问题。
问题:查找 HTML 页面的“内容”部分。
我所说的“内容”是指包含人类所看到的页面内容的 dom,没有噪音,只是“页面实际内容”。 我知道问题没有明确定义,但让我们继续...... 例如,在博客网站中,这通常很容易,当浏览到特定帖子时,您通常在页面顶部有一些工具栏,也许在 LHS 上有一些导航元素,然后您有包含内容的 div。 尝试从 HTML 中找出这一点可能很棘手。 不过,幸运的是,大多数博客都有 RSS 提要,并且在该特定帖子的提要中,您会发现
那么新闻网站呢? 在许多情况下,新闻网站都有 RSS,但并非总是如此。 那么如何在新闻网站上查找内容呢? 更一般的网站呢? 许多网页(当然不是全部)都有内容部分和其他部分。 你能想出一个好的算法来找到“有趣”的部分和不那么有趣的部分吗? 也许是那些没有改变的部分发生了变化?
希望我已经说清楚了...谢谢!
This is not really a programming question, more of an algorithmic one.
The problem: Finding the "content" section of an HTML page.
By "content" I mean the dom that contains the page content as seen by humans, without the noise, simply the "page actual content".
I know the problem is not well defined, but let's continue...
For example in blog sites, this is is usually easy, when browsing to a specific post you usually have some toolbars at the top of the page, maybe some navigation elements on the LHS and then you have the div that contains the content. Trying to figure this out from the HTML can be tricky. Luckily, however, most blogs have RSS feeds and in the feed for this specific post you'd find a <description> section (or <content:encoded>) and this is exactly what you want.
So, to refine the definition of content, this is the actual thing on the page that contains the interesting part, removing all the ads, navigation elements etc.
So finding content from blogs is relatively easy, assuming they have RSS. Same goes for other RSS supportive sites.
What about news sites? In many cases news sites have RSS, but not always. How does one find content on news sites then?
What about more general sites? Many web pages (of course not all of them) have content section and other sections. Can you think of a good algorithm to find the sections that are "interesting" v/s the less interesting? Perhaps the sections that change from those that do not change?
Hope I've made myself clear... Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我还没有这样做,但这将是我的一般方法。
正如您所指出的,可见内容部分缺乏结构(即没有
header
、navigation
、ads
等标签) HTML 意味着更难定位页面的关键部分。 我的方法是首先删除您确定不感兴趣的不同元素。 可能的排除列表可以是:!doctype
、head
(将title
作为单独的数据片段)object
、embed
、applet
、script
img
form
、input
、textarea
、label
、legend
、select
、option
第二遍可以开始排除常见的
div
或ul
> id/class 名称,以及其中的所有标签,例如:header
、footer
、meta
nav
、导航
、topnav
、侧边栏
广告
、广告
、adu (以及广告常用的其他名称)
这有望从页面中删除大量装饰。 下一个挑战是尝试从剩下的内容中识别主要内容,我建议首先假设网站作者正确使用语义 HTML,因此主要使用
h1
、h2
头部标签和p
段落标签。为了识别内容,我会查找任何标题标签,然后跟随一个段落标签。 (对于您的主要内容,这可能是
h2
;h1
标签经常(并且可以说是不正确的)用于显示网站名称或徽标,但这有望被消除通过排除页面的标题部分。)每个后续段落都应添加到当前内容中,直到到达中断处,该中断可能是div
或td
的末尾> 元素,或者它可能是与您开始的同一级别的标题元素。由于您在页面上可能仍然收集了几组内容(可能是主要内容加上有关作者的简介),因此您需要在此处测试和完善决策步骤,以选择最有可能的候选者。 无论是在长度还是所使用的段落元素数量方面,这通常都是最大的。
当您收集更多内容示例时,您可以向算法添加支持措施; 您可能会注意到许多页面使用
div id="content"
或id="maincontent"
。 保留您检测到的次要内容项也可能很有用,这样,如果某些网站有一种奇怪的内容结构方式,那么一旦您将捕获器添加到算法中,它就可以针对此重新运行网站的内容。I haven't done this, but this would be my general approach.
As you indicate, the lack of structure in the visible content parts (i.e. it doesn't have tags such as
header
,navigation
,ads
) of HTML means it is harder to home in on the key part of the page. My approach would be to first remove distinct elements which you have definitely decided are not interesting. A possible list of exclusions could be:!doctype
,head
(take thetitle
as a separate piece of data)object
,embed
,applet
,script
img
form
,input
,textarea
,label
,legend
,select
,option
A second pass could then start to exclude commonly occurring
div
orul
id/class names, and all tags within them, such as:header
,footer
,meta
nav
,navigation
,topnav
,sidebar
ad
,ads
,adu
(and other names commonly used for ads)This will hopefully remove a significant amount of decoration from the page. The next challenge is to try to identify the main content from what's left, and I would suggest initially assuming that the site author is using semantic HTML properly, and so is principally using the
h1
,h2
head tags and thep
paragraph tag.To identify content, I would look for any header tag which is then followed by a paragraph tag(s). (This may be
h2
for your main content; theh1
tag is often (and arguably incorrectly) used to display the site name or logo, but this will hopefully have been eliminated by excluding the header parts of the page.) Each subsequent paragraph should be added to the current content until you reach a break, which could either be the end of thediv
ortd
element, or it could be a header element of the same level you started from.As there may still be several sets of content that you've gathered on the page (maybe the main content plus the blurb about the author), you need to test and refine a decision-making step here which chooses the most likely candidate. This will often simply be the largest, both in terms of length and number of paragraph elements used.
As you gather more examples of content, you can add supporting measures to your algorithm; this might be that you notice many of the pages use
div id="content"
orid="maincontent"
. It may also be useful to retain the secondary items of content that you detected, so that if certain sites have a curious way of structuring the content, then once you've add a catcher into your algorithm it can be re-run against just this site's content.一个结构良好的网站将有其公共区域重用相同的代码,例如导航、标题等。
当您有想要分析的目标页面时,请尝试浏览同一域/子域下的一些其他页面并查找元素这是所有页面共有的。 这些是您想要消除的噪音。
然后你可以看看剩下的内容,看看是否有一些噪音溜进来。当你收集了合理数量的这些数据后,尝试在其中找到一些模式。 完善你的逻辑并重复。
A well structured site will have its common areas reusing the same code, e.g. navigation, header and etc.
When you have a target page that you would like to analyze, try browse through a few other pages under the same domain/subdomain and find elements which are common to all pages. Those are the noises you want to get rid of.
Then you can take a look at what's remaining, to see if some noises slipped in. When you have collected a reasonable amount of those data, try to find some pattern in them. Refine your logic and repeat.