从博客和其他网页中智能提取标签
我不是在谈论 HTML 标签,而是用于描述本网站上的博客文章、YouTube 视频或问题的标签。
如果我只抓取一个网站,我只需使用 xpath 来提取标签,甚至使用正则表达式(如果简单的话)。但我希望能够在 extract_tags() 函数中抛出任何网页并获取列出的标签。
我可以想象使用一些简单的启发式方法,例如查找具有 id 或“标签”类别的所有 HTML 元素等。但是,这非常脆弱,并且对于大量网页可能会失败。你们建议采取什么方法来解决这个问题?
另外,我知道 Zemanta 和 Open Calais,它们都有办法猜测一段文本的标签,但这与提取真实人类已经选择的标签并不相同。但我仍然希望听到任何其他服务/API 来猜测文档中的标签。
编辑:需要明确的是,一个已经适用于此的解决方案将会很棒。但我猜还没有开源软件可以做到这一点,所以我真的只是想听听人们关于适用于大多数情况的可能方法。它不必是完美的。
编辑2:对于建议通常有效的通用解决方案是不可能的,并且我必须为每个网站/引擎编写自定义抓取工具的人,请考虑 arc90可读性工具。我相信该工具能够使用某种启发式算法以令人惊讶的准确性提取网络上任何给定文章的文章文本。我还没有深入研究他们的方法,但它适合一个小书签,而且似乎不太复杂。我知道提取文章可能比提取标签更简单,但它应该作为可能的示例。
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
像 arc90 示例这样的系统,您通过查看标签/文本比率和其他启发式方法来进行工作。页面的文本内容与周围的广告/菜单等之间存在足够的差异。其他示例包括抓取电子邮件或地址的工具。这里有可以检测到的模式、可以识别的位置。就标签而言,尽管您没有太多帮助您将标签与普通文本区分开来,但它只是一个单词或短语,就像任何其他文本一样。侧边栏中的标签列表很难与导航菜单区分开。
有些博客(例如 tumblr)确实有标签,其网址中包含您可以使用的“标记”一词。 Wordpress 类似地也有“.../tag/...”类型的标签 URL。此类解决方案适用于大量博客,无论其单独的页面布局如何,但它们并非在所有地方都适用。
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
如果源将其数据公开为提要 (RSS/Atom),那么您也许能够从此结构化数据中获取标签(或标签/类别/主题等)。
另一种选择是解析每个网页并查找根据 rel=tag 微格式。
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
该死的,我只是想建议开放加来。不会有“伟大”的方法来做到这一点。如果您心中有一些目标平台,您可以嗅探 Wordpress,然后查看它们的链接结构,然后再次查看 Flickr...
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
我认为您唯一的选择是为每个站点编写自定义脚本。为了让事情变得更简单,你可以看看 AlchemyApi。它们具有与 OpenCalais 类似的实体提取功能,但它们还具有“结构化内容抓取”产品这使得它比通过使用简单的视觉约束来识别网页的各个部分来编写 xpath 容易得多。
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
这是不可能的,因为没有众所周知的、可遵循的规范。即使同一引擎的不同版本也可能创建不同的输出 - 嘿,使用 Wordpress,用户可以创建自己的标记< /a>.
如果您真的对做这样的事情感兴趣,您应该知道这将是一个实时耗时且持续的项目:您将创建一个库来检测页面中正在使用哪个“引擎”,并解析它。如果由于某种原因无法检测到页面,您可以创建新规则来解析并继续。
我知道这不是您正在寻找的答案,但我真的看不到其他选择。我喜欢 Python,所以我会使用 Scrapy 因为它是一个完整的抓取框架:它是完整的,嗯记录在案并且真正可扩展。
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
尝试制作 Yahoo Pipe 并通过术语提取器模块运行源页面。它可能会或可能不会产生很好的结果,但值得一试。注意 - 启用 V2 引擎。
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
看看 arc90,他们似乎还要求出版商使用语义上有意义的标记[参见 https://www.readability.com/publishers/guidelines/#view-exampleGuidelines],以便他们可以相当容易地解析它,但想必他们必须要么开发了通用规则,例如@dunelmtech建议的标签/文本比率,可以与文章检测一起使用,或者他们可能结合使用一些文本分割算法(来自自然语言处理领域),例如TextTiler和C99,对于文章检测可能非常有用 - 请参阅 http://morphadorner.northwestern.edu/morphadorner/textsegmenter/< /a> 和 google 了解有关两者的更多信息 [发表在学术文献 - 谷歌学者]。
然而,检测您需要的“标签”似乎是一个困难的问题(由于上面评论中已经提到的原因)。我尝试的一种方法是使用一种文本分段(C99 或 TextTiler)算法来检测文章开头/结尾,然后使用 CLASS 和 CLASS 查找 DIV/SPAN/UL。 ID 属性中包含 ..tag..,因为就页面布局而言,标签通常位于文章下方和评论提要上方,这可能效果出奇地好。
不管怎样,看看你是否能进行标签检测会很有趣。
马丁
编辑:我刚刚发现了一些可能真正有用的东西。该算法称为 VIPS [参见:http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html],代表基于视觉的页面分割。它基于这样的想法:页面内容可以在视觉上分为几个部分。与基于 DOM 的方法相比,VIPS 获得的片段在语义上聚合得多。诸如导航、广告和装饰之类的干扰信息可以很容易地被删除,因为它们通常被放置在页面的某些位置。这可以帮助您非常准确地检测标签块!
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as @dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
Drupal 中有一个术语提取器模块。 (http://drupal.org/project/extractor) 但它仅适用于 Drupal 6。
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.