寻找主要内容的启发式方法

发布于 2024-10-17 19:11:41 字数 283 浏览 4 评论 0原文

想知道是否有人可以向我指出学术论文的方向或启发式方法的相关实现,以查找特定网页的真实内容。

显然这不是一个简单的任务,因为问题描述是如此模糊,但我认为我们都对页面主要内容的含义有一个大致的了解。

例如,它可能包括新闻文章的故事文本,但可能不包括任何导航元素、法律免责声明、相关故事预告片、评论等。文章标题、日期、作者姓名和其他元数据属于灰色类别。

我认为这种方法的应用价值很大,并且希望谷歌在他们的搜索算法中以某种方式使用它,所以在我看来,这个主题过去已经被学术界对待过。

有参考资料吗?

Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.

Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.

For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article titles, dates, author names, and other metadata fall in the grey category.

I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.

Any references?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

嘿嘿嘿 2024-10-24 19:11:41

看待这个问题的一种方法是将其视为信息提取问题。

因此,一种高级算法是收集相同页面类型的多个示例,并推断页面不同部分的解析(或提取)规则(这可能是主要主题)。直觉是,常见的样板文件(页眉、页脚等)和广告最终会出现在这些网页的多个示例上,因此通过对其中一些示例进行训练,您可以快速开始可靠地识别此样板文件/附加代码,然后忽略它。它并非万无一失,但这也是商业和学术网络抓取技术的基础,例如 RoadRunner:

引文是:

Valter Crescenzi,Giansalvatore Mecca,
保罗·梅里亚多:RoadRunner:走向
自动从大数据中提取数据
网站。 VLDB 2001:109-118

还有一项被广泛引用的提取技术调查:

阿尔贝托·HF·兰德 (Alberto HF Laender)、贝蒂尔·A.
里贝罗-内托、阿尔蒂格兰·S·达席尔瓦、
朱莉安娜·S·特谢拉 (Juliana S. Teixeira),简要调查
Web数据提取工具,ACM SIGMOD
记录,v.31 n.2,2002 年 6 月
[doi>10.1145/565117.565137]

One way to look at this would be as an information extraction problem.

As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf

The citation is:

Valter Crescenzi, Giansalvatore Mecca,
Paolo Merialdo: RoadRunner: Towards
Automatic Data Extraction from Large
Web Sites. VLDB 2001: 109-118

There's also a well-cited survey of extraction technologies:

Alberto H. F. Laender , Berthier A.
Ribeiro-Neto , Altigran S. da Silva ,
Juliana S. Teixeira, A brief survey of
web data extraction tools, ACM SIGMOD
Record, v.31 n.2, June 2002
[doi>10.1145/565117.565137]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文