如何从 HTML 页面中仅提取主要文本内容?

发布于 2024-11-28 21:58:35 字数 860 浏览 1 评论 0原文

更新

Boilerpipe 似乎工作得很好,但我意识到我不需要只主要内容,因为许多页面没有文章,而只需要对整个文本进行一些简短描述的链接(这在新闻门户中很常见)我不想丢弃这些简短的文字。

因此,如果 API 执行此操作,获取不同的文本部分/块,以某种不同于单个文本的方式分割每个文本部分(全部包含在一个文本中是没有用的),请报告。


问题

我从随机站点下载了一些页面,现在我想分析该页面的文本内容。

问题是网页有很多内容,如菜单、宣传、横幅等。

我想尝试排除所有与页面内容无关的内容。

以此页面为例,我不希望页脚中的链接上方出现菜单。

重要提示:所有页面都是 HTML,并且是来自各个不同站点的页面。我需要有关如何排除这些内容的建议。

目前,我认为从 HTML 中排除“菜单”和“横幅”类中的内容以及看起来像正确名称(第一个大写字母)的连续单词。

解决方案可以基于文本内容(不带 HTML 标签)或 HTML 内容(带 HTML 标签)

编辑: 我想在我的 Java 代码中执行此操作,而不是外部应用程序(如果这可能的话)。

我尝试了一种解析这个问题中描述的 HTML 内容的方法:https://stackoverflow.com/questions/ 7035150/如何使用jsoup-doing-some-content-filtering遍历dom-tree

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

我的奇迹 2024-12-05 21:58:35

看看 Boilerpipe。它旨在完全满足您的需求,消除网页主要文本内容周围多余的“混乱”(样板文件、模板)。

有几种方法可以将 HTML 提供给 Boilerpipe 并提取 HTML。

您可以 使用 URL

ArticleExtractor.INSTANCE.getText(url);

您可以使用一个字符串

ArticleExtractor.INSTANCE.getText(myHtml);

还有 使用阅读器,它会提供大量选项。

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

梦回梦里 2024-12-05 21:58:35

您还可以使用 boilerpipe 将文本分段为全文/非文本块全文,而不是只返回其中一个(本质上,首先是锅炉管段,然后返回一个字符串)。

假设您可以从 java.io.Reader 访问 HTML,只需让锅炉管对 HTML 进行分段并为您分类:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock 有一些更令人兴奋的方法,请随意尝试!

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

南街九尾狐 2024-12-05 21:58:35

Boilerpipe 似乎可能存在问题。为什么?
嗯,这似乎适合某些类型的网页,例如具有单一内容主体的网页。

因此,就Boilerpipe而言,可以粗略地将网页分为三类:

  1. 一篇文章的网页(值得Boilerpipe!)
  2. 一篇多篇文章的网页,例如《纽约时报》的头版
  3. 网页其中确实没有任何文章,但有一些有关链接的内容,但也可能有一定程度的混乱。

Boilerpipe 适用于案例 #1。但是,如果一个人正在进行大量的自动文本处理,那么一个人的软件如何“知道”它正在处理哪种网页?如果网页本身可以归为这三个类别之一,那么 Boilerpipe 就可以应用于案例 #1。情况 #2 是一个问题,情况 #3 也是一个问题 - 它可能需要相关网页的聚合来确定哪些是杂乱的,哪些不是杂乱的。

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

So one can crudely classify web pages into three kinds in respect to Boilerpipe:

  1. a web page with a single article in it (Boilerpipe worthy!)
  2. a web with multiple articles in it, such as the front page of the New York times
  3. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.

Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

无名指的心愿 2024-12-05 21:58:35

您可以使用一些库,例如 goose。它最适合文章/新闻。
您还可以使用可读性书签检查与 goose 进行类似提取的 JavaScript 代码

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

救星 2024-12-05 21:58:35

我的第一反应是采用您最初使用 Jsoup 的方法。至少这样,您可以使用选择器并仅检索您想要的元素(即 Elements posts = doc.select("p"); 而不必担心具有随机内容的其他元素 )

关于您的其他帖子,误报问题是您偏离 Jsoup 的唯一原因吗?如果是这样,您不能调整 MIN_WORDS_SEQUENCE 的数量或对选择器更有选择性(即不检索 div 元素)

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

别在捏我脸啦 2024-12-05 21:58:35

http://kapowsoftware.com/products/kapow-katalyst-platform/robo -server.php

专有软件,但它使得从网页中提取变得非常容易,并且与java集成良好。

您可以使用提供的应用程序来设计由 roboserver api 读取的 xml 文件来解析网页。 xml 文件是通过分析您希望在提供的应用程序中解析的页面(相当简单)并应用收集数据的规则(通常,网站遵循相同的模式)来构建的。您可以使用提供的 Java API 设置调度、运行和数据库集成。

如果您反对使用软件并自己动手,我建议不要尝试将 1 条规则应用于所有网站。找到一种方法来分离标签,然后构建每个站点

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

绝對不後悔。 2024-12-05 21:58:35

您正在寻找所谓的“HTML 抓取工具”或“屏幕抓取工具”。以下是一些选项的链接:

标签汤

HTML 单元

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

Tag Soup

HTML Unit

执妄 2024-12-05 21:58:35

您可以过滤 html 垃圾,然后解析所需的详细信息或使用现有站点的 api。
请参考下面的链接来过滤 html,希望对您有所帮助。
http:// thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

雪花飘飘的天空 2024-12-05 21:58:35

您可以使用 textracto api,它提取主要的“文章”文本,并且还有机会提取所有内容其他文本内容。通过“减去”这些文本,您可以从主要文本内容中分离出导航文本、预览文本等。

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文