寻找主要内容的启发式方法
想知道是否有人可以向我指出学术论文的方向或启发式方法的相关实现,以查找特定网页的真实内容。
显然这不是一个简单的任务,因为问题描述是如此模糊,但我认为我们都对页面主要内容的含义有一个大致的了解。
例如,它可能包括新闻文章的故事文本,但可能不包括任何导航元素、法律免责声明、相关故事预告片、评论等。文章标题、日期、作者姓名和其他元数据属于灰色类别。
我认为这种方法的应用价值很大,并且希望谷歌在他们的搜索算法中以某种方式使用它,所以在我看来,这个主题过去已经被学术界对待过。
有参考资料吗?
Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.
Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.
For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article titles, dates, author names, and other metadata fall in the grey category.
I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.
Any references?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看待这个问题的一种方法是将其视为信息提取问题。
因此,一种高级算法是收集相同页面类型的多个示例,并推断页面不同部分的解析(或提取)规则(这可能是主要主题)。直觉是,常见的样板文件(页眉、页脚等)和广告最终会出现在这些网页的多个示例上,因此通过对其中一些示例进行训练,您可以快速开始可靠地识别此样板文件/附加代码,然后忽略它。它并非万无一失,但这也是商业和学术网络抓取技术的基础,例如 RoadRunner:
引文是:
还有一项被广泛引用的提取技术调查:
One way to look at this would be as an information extraction problem.
As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf
The citation is:
There's also a well-cited survey of extraction technologies: