从网页中提取有意义的完整内容

发布于 2024-07-14 01:05:26 字数 435 浏览 5 评论 0原文

我正在通过使用爬虫挖掘网络内容来进行一些分析。 网页的文章正文周围经常包含杂乱内容(例如广告、不必要的图像和无关链接),这些内容会分散用户对实际内容的注意力。

据我了解,提取合理内容是一个难题,考虑到没有标准来定义新闻故事/博客文章/论坛评论/文章在网页中的实际位置。

我可以找到一些像这样的开源解决方案: https://metacpan.org/pod/HTML:: ContentExtractor

但我很好奇是否有人处理过这个问题并获得了合理的成功率。 这似乎是一个相当普遍的问题,我相信有很多专家。 我更喜欢基于 JAVA 的解决方案,但这不是一个硬性规则。 请提供一些意见。 我会深深地感激。

I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content.

To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no standard that defines the actual position of a news-story/blog post/forum comment/article in the web page.

I could find some open source solutions like this: https://metacpan.org/pod/HTML::ContentExtractor

But I am curious if anyone has dealt with this and got reasonable success rate. It seems a fairly common problem and I would like to believe many experts are out there. I would prefer a JAVA based solution but that is not a hard rule. Please give some inputs. I will deeply appreciate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

開玄 2024-07-21 01:05:26

理想情况下,您会寻找 RSS 源来获取原始内容。

总体结构和结构没有标准。 HTML 中的含义。 作者在他们的页面中定义了不同的元素。 搜索引擎在这个领域投入了大量资金,它们有自己的秘密武器来索引内容并获取某种含义和内容。 结构出来的搜索排名。

在我们拥有早已预言的“语义网”之前,我们只能对任意 HTML 页面的结构和含义做出有根据的猜测。

但是,理论上:

寻找标题标签。 这些应该为您提供从哪里开始阅读的线索,并希望为您提供内容重要性顺序的概述。

查找公共元素 id 和类。 结构良好的网站可能具有

之类的内容,其语义如下这些天来了。 还要了解常见 CMS 平台(例如 WordPress(“帖子”)或 Drupal(“节点”))使用的标准元素名称。 通常这些将用于标记内容。

最后但并非最不重要的一点是,查找微格式

Ideally, you would look for an RSS feed to get the raw content.

The is no standard for overall structure & meaning in HTML. Authors define different elements in their page. Search engines have invested a lot into this area, and they have their own secret sauce for indexing the content and getting some kind of meaning & structure out of it for search ranking.

Until we have the long-foretold "semantic web", we can only make educated guesses about the structure and meaning of arbitrary HTML pages.

But, in theory:

Look for heading tags. These should give you a clue for where to start reading, and hopefully an outline for the order of importance for the content.

Look for common element id and classes. A well-structured site might have things like <div id="content"> and <div class="article">, which is as semantic as it gets these days. Also get to know the standard element names used by common CMS platforms like WordPress ("post") or Drupal ("node"). Often these will be used to mark up the content.

Last but not least, look for microformats.

街道布景 2024-07-21 01:05:26

现在有项目数量 将此任务作为他们的主要目标。

NPM 包 WCE (Javascript) 很有趣,因为它使用了许多其他内容提取模块引擎盖。

抱歉,我本来想早点回答这个问题,但我很忙。

There are now a number of projects with this task as their primary goal.

The NPM package WCE (Javascript) is interesting because it uses a number of other content-extraction modules under the hood.

Sorry I meant to reply to this question earlier but I was busy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文