有没有工具可以隔离网页内容?

发布于 2024-10-04 01:52:08 字数 281 浏览 0 评论 0原文

我正在开展一个学校项目,我们想在其中分析网页的内容。然而,我们不想处理诸如导航栏和评论之类的事情。如果我们正在查看一个特定的网站,我们可以创建一个解析器来专门为该网站过滤掉此类无关的内容,但我们希望能够在我们以前可能从未遇到过的任意网站上工作。

我觉得这种希望有点太大了,所以如果这样的东西不存在的话我不会感到惊讶,但是有人知道有一种工具可以在任意网站上进行这种内容隔离吗?我有幸与同一站点的其他人比较页面,但它并不完美,并且会留下评论等。

我正在使用 Java 工作,但欢迎任何语言的开源内容,我可以将其用于创意。

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.

I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.

I am working in Java, but would welcome anything open source in any language that I can use for ideas.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

铜锣湾横着走 2024-10-11 01:52:08

我对此有点晚了(尤其是对于学校项目),但如果有人在将来的某个时候发现这一点,以下内容可能会有所帮助。

我偶然发现了一个 Java 库来完成这个任务。在我的简单测试中,性能与可读性相似。

http://code.google.com/p/boilerpipe/

I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.

I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.

http://code.google.com/p/boilerpipe/

听闻余生 2024-10-11 01:52:08

您可以尝试 arc90 的非官方 API可读性。

基本上,可读性的作用是提取网页上的内容并将其作为格式良好的文章呈现给您。导航栏、评论以及网页内容周围的所有其他内容都消失了。

You could try an unofficial API of arc90's Readability.

Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.

暮年 2024-10-11 01:52:08

我对这次对话也有点晚了,但是...

Java Boilerpipe 提取器可能就是您想要的(可能是ArticleSentencesExtractor),尽管 github 上至少有 1 个 arc90 可读性的 java 端口。

如果你想构建一个穷人的锅炉管道,你可以尝试比较同一站点的 2 个页面(假设它们使用相同的模板,你可能会得到一个有趣的结果)

锅炉管道、可读性和基于差异的黑客之间的主要区别是该boilerpipe将删除所有html但保留一些结构

im also a bit late to this conversation but ...

the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.

If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)

The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure

杯别 2024-10-11 01:52:08

我怀疑是否存在任何可以满足您要求的东西。如果没有某种语义标记,几乎不可能将“真实”内容与其他内容区分开来。这是一项需要真正智慧的任务。

当然,有一些很好的工具可以解析不同程度正确性的 HTML,并且通常可以拼凑出一些基于模式的解决方案来处理特定站点上的页面……假设存在需要引出的通用结构/模式。

I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.

There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文