HTML 内容提取的最新技术水平如何?
有很多关于 HTML 内容提取的学术工作,例如 Gupta 和 Gupta。 Kaiser (2005) 从可访问的网页中提取内容 ,以及这里一些有趣的标志,例如 一个、两个,和三个,但我不太清楚后者的实践在多大程度上反映了前者的想法。最佳实践是什么?
我正在寻找良好的(特别是开源)实现的指针和良好的实现学术调查。
后记第一:准确地说,我所追求的调查类型是一篇论文(已发表的、未发表的,等等),讨论学术文献中的标准以及一些现有的实现,并从标准的角度分析实施的不成功程度。而且,实际上,在邮件列表中发帖也对我有用。
后记第二 需要明确的是,在我接受了Peter Rowell的回答之后,我们可以看到这个问题引出了两个子问题:(i)解决了清理不合格HTML的问题,对于哪个 Beautiful Soup 是最值得推荐的解决方案,以及 (ii) 未解决的问题或将残渣(主要是网站添加的样板和宣传材料)与肉类(认为该页面可能有趣的人实际上发现相关的内容)分开。为了解决最先进的问题,新的答案需要明确解决肉制品问题。
There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?
Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.
Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.
Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
提取对于不同的人来说可能意味着不同的事情。能够处理所有乱七八糟的 HTML 是一回事,而 Beautiful Soup 显然是这个领域的赢家。但 BS 不会告诉你什么是粗品,什么是肉。
从计算语言学家的角度考虑内容提取时,事情看起来有所不同(而且丑陋)。在分析页面时,我只对页面的特定内容感兴趣,减去所有导航/广告等。粗鲁的。除非你摆脱了这些麻烦,否则你无法开始做有趣的事情——共现分析、短语发现、加权属性向量生成等。
OP 引用的第一篇论文表明,这就是他们想要实现的目标——分析站点,确定整体结构,然后将其减去,瞧!你只有肉——但他们发现它比他们想象的要难。他们从改进的可访问性角度来解决这个问题,而我是一个早期的搜索引擎人员,但我们都得出了相同的结论:
将残渣与肉分开很难。并且(从字里行间看出你的问题)即使删除了残渣,如果没有仔细应用语义标记,也很难确定文章的“作者意图”。从像 citeseer 这样的网站(干净且可预测地布局,具有非常高的信噪比)中获取内容比处理随机 Web 内容容易2 或 3 个数量级。
顺便说一句,如果您正在处理较长的文档,您可能会对 Marti Hearst(现为加州大学伯克利分校教授)。她的博士论文和其他关于在大型文档中进行子主题发现的论文给出了我对在较小的文档中做类似的事情有很多见解(令人惊讶的是,这可能更难处理)。但只有在清除掉这些残骸之后才能做到这一点。
对于少数可能感兴趣的人,这里有一些背景故事(可能是题外话,但我今晚就是这种心情):
在 80 年代和 90 年代,我们的客户大多是政府机构,他们的眼睛比预算更大,他们的梦想实现了迪士尼乐园看起来单调乏味。他们收集了他们能得到的所有东西,然后寻找一种银弹技术,以某种方式(巨大的挥手)提取文档的“含义”。正确的。他们找到我们是因为我们是一家奇怪的小公司,在 1986 年进行“内容相似性搜索”。我们给了他们几个演示(真实的,不是伪造的),这吓坏了他们。
我们已经知道的一件事(他们花了很长时间才相信我们)是每个藏品都是不同的,需要它自己的特殊扫描仪来处理这些差异。例如,如果你所做的只是咀嚼报纸上的故事,生活就相当轻松了。标题主要告诉你一些有趣的事情,故事是用金字塔风格写的——第一段或第二段有谁/什么/何地/何时的内容,然后接下来的段落对此进行了扩展。就像我说的,这是很容易的事情。
杂志文章怎么样?天哪,别让我开始!标题几乎总是毫无意义,而且每本杂志的结构都不同,甚至杂志的一个章节到下一个章节的结构也各不相同。拿起一份《连线》杂志和一份《大西洋月刊》。查看一篇主要文章,并尝试找出该文章内容的有意义 1 段摘要。现在尝试描述一个程序如何完成同样的事情。同一组规则是否适用于所有文章?甚至来自同一杂志的文章?不,他们没有。
抱歉,听起来像个脾气暴躁的人,但这个问题确实很难。
奇怪的是,谷歌如此成功的一个重要原因(从搜索引擎的角度来看)是他们对来自另一个网站的链接及其周围的单词给予了很大的重视。该链接文本代表由人对其链接到的网站/页面进行的一种小型摘要,正是您在搜索时想要的内容。它适用于几乎所有类型/布局风格的信息。这是一个非常精彩的见解,我希望我自己也有这样的见解。但这不会给我的客户带来任何好处,因为从昨晚的莫斯科电视节目表中没有链接到他们捕获的一些随机电传打字信息,或者到埃及报纸的一些 OCR 版本很糟糕的链接。 。
/迷你咆哮和绊倒记忆通道
Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.
Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.
The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:
Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.
BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.
For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):
In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.
One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.
How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.
Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.
Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.
/mini-rant-and-trip-down-memory-lane
一个词:锅炉管。
对于新闻领域,在代表性语料库上,我们现在的提取准确度为 98% / 99%(平均/中位数)
也完全独立于语言(今天,我了解到它也适用于尼泊尔语)。
免责声明:我是这部作品的作者。
One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.
你见过boilerpipe吗?发现在类似的问题中提到了它。
Have you seen boilerpipe? Found it mentioned in a similar question.
我遇到过 http://www.keyvan.net/2010/08/php-可读性/
I have come across http://www.keyvan.net/2010/08/php-readability/
有一些开源工具可以执行类似的文章提取任务。
https://github.com/jiminoc/goose 是由 Gravity.com 开源的
它有以下信息wiki 以及您可以查看的源代码。有数十个单元测试显示从各种文章中提取的文本。
there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.
多年来,我一直与 Peter Rowell 合作开展各种信息检索项目,其中许多项目涉及从各种标记源中提取非常困难的文本。
目前,我专注于从 Google 等“消防水管”来源中提取知识,包括他们的 RSS 管道,这些管道可以吸收大量本地、区域、国家和国际新闻文章。在许多情况下,标题是丰富且有意义的,但只是用来吸引流量到网站的“钩子”,而实际的文章是无意义的段落。这似乎是一种旨在提高流量评级的“反向垃圾邮件”。
即使使用最简单的文章长度度量来对文章进行排名,您也必须能够从标记中提取内容。如今,在 Web 内容中占主导地位的奇特标记和脚本在应用于 Google 和类似源的大量特征时,会破坏大多数开源解析包(例如 Beautiful Soup)。根据经验,我发现 30% 或更多的挖掘文章破坏了这些包。这使我们重新专注于开发非常低级别的、智能的、基于字符的解析器,以将原始文本与标记和脚本分开。您的解析(即内容分区)越细粒度,您的工具就必须越智能(并且手工制作)。为了让事情变得更有趣,您有一个不断变化的目标,因为随着新脚本方法、标记和语言扩展的发展,Web 创作不断演变和变化。这往往有利于基于服务的信息交付,而不是“收缩包装”应用程序。
回顾这些年来,似乎很少有关于这种提取的低级机制(即您提到的“前者的实践”)的学术论文,可能是因为它的领域和内容非常特定。
I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.
Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.
To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.
Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.
Beautiful Soup 是一个用 Python 编写的强大的 HTML 解析器。
它可以优雅地处理带有不良标记的 HTML,并且作为一个 Python 库也经过精心设计,支持用于迭代和搜索的生成器、用于子访问的点表示法(例如,访问<< /foo>' 使用 doc.foo.bar`) 和无缝 unicode。
Beautiful Soup is a robust HTML parser written in Python.
It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access
<foo><bar/></foo>' using
doc.foo.bar`) and seamless unicode.如果您要从大量使用 javascript 的页面中提取内容,selenium 远程控制 可以执行以下操作:工作。它的作用不仅仅是测试。这样做的主要缺点是您最终会使用更多的资源。好处是您将从丰富的页面/应用程序中获得更准确的数据源。
If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.