使用免费工具进行实体提取/识别,同时提供 Lucene 索引
我目前正在研究从文本(许多文章来自网络)中提取人名、位置、技术单词和类别的选项,然后将其输入到 Lucene/ElasticSearch 索引中。然后附加信息作为元数据添加,并且应该提高搜索的精度。
例如,当有人查询“wicket”时,他应该能够决定他指的是板球运动还是 Apache 项目。我尝试自己实现这一点,到目前为止取得了一些小成功。现在我找到了很多工具,但我不确定它们是否适合这个任务,哪些与 Lucene 集成得很好,或者实体提取的精度是否足够高。
- Dbpedia Spotlight,演示看起来非常有前途
- OpenNLP 需要 训练。使用哪些训练数据?
- OpenNLP 工具
- Stanbol
- NLTK
- balie
- UIMA
- GATE -> 示例代码
- Apache Mahout
- 斯坦福 CRF-NER
- <一个href="http://code.google.com/p/maui-indexer" rel="noreferrer">maui-indexer
- Mallet
- 伊利诺伊州命名实体标记器 未开放来源但免费
- wikipedianer 数据< /a>
我的问题:
- 有人对上面列出的一些工具及其精确度/召回率有经验吗?或者是否需要+可用的训练数据。
- 是否有文章或教程可以让我开始使用每个工具的实体提取 (NER)?
- 它们如何与 Lucene 集成?
以下是与该主题相关的一些问题:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您在“wicket”示例中遇到的问题称为实体消歧,而不是实体提取/识别 (NER)。 NER 可能很有用,但前提是类别足够具体。大多数 NER 系统没有足够的粒度来区分运动和软件项目(这两种类型都不属于通常识别的类型:人员、组织、位置)。
为了消除歧义,您需要一个知识库来消除实体的歧义。 DBpedia 因其覆盖范围广泛而成为典型选择。请参阅我的答案 如何使用 DBPedia从内容中提取标签/关键字?我在其中提供了更多解释,并提到了几种消除歧义的工具,包括:
Extractiv(我的公司)这些工具通常使用与语言无关的 API,如 REST,我不知道它们直接提供 Lucene 支持,但我希望我的回答对您要解决的问题有益。
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Extractiv (my company)These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
您无需培训即可使用 OpenNLP 提取人名、地点、组织名称。您只需使用可以从此处下载的预先存在的模型: http://opennlp.sourceforge.net/ models-1.5/
有关如何使用这些模型之一的示例,请参阅:http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html#tools.namefind
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka 是一种商业产品,提供“显着性”计算,衡量术语或实体对文档的重要性。显着性基于语言的使用而不是频率。使用显着性值,您可以确定整个文档的主要主题。
输出是您选择的 XML 或 JSON,这使得它非常容易与 Lucene 一起使用。
它是用java编写的。
有一个 Amazon Cloud 版本,位于 https://aws.amazon.com/marketplace/pp/B00E6FGJZ0 。试用费用为 0.99 美元/小时。 Rosoka Cloud 版本不具备完整 Rosoka 所具备的所有 Java API 功能。
是的,两个版本都根据语言用法执行实体和术语消歧。
无论是人还是软件,消歧都需要有足够的上下文信息才能确定差异。上下文可以包含在文档内、语料库约束内或用户的上下文内。前者更具体,后者具有更大的潜在模糊性。即在 Google 搜索中输入关键字“wicket”,可以指板球、Apache 软件或星球大战伊沃克人角色(即实体)。一般句子“The wicket is Guarded by the batsman”在句子中具有上下文线索,可将其解释为对象。 “Wicket Wystri Warrick 是一名男性伊沃克侦察兵”应将“Wicket”解释为个人实体“Wicket Wystri Warrick”的名字。 “Welcome to Apache Wicket”具有上下文线索,表明“Wicket”是地名的一部分等。
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
最近我一直在摆弄 stanford crf ner。他们已经发布了很多版本 http://nlp.stanford.edu/software/CRF- NER.shtml
好处是您可以训练自己的分类器。您应该点击包含如何训练您自己的 NER 指南的链接。 http://nlp.stanford.edu/software/crf-faq.shtml#a
不幸的是,就我而言,命名实体无法从文档中有效提取。大多数实体都未被发现。
以防万一您发现它有用。
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.