我正在尝试在 Lucene 中进行实体提取(更像是匹配)。以下是一个示例工作流程:
给定一些文本(来自 URL)和人员姓名列表,尝试从文本中提取人员姓名。
注意:
人名不完整
标准化。例如,有些是 X 先生、X 女士。
Y 和一些人就是 John Doe、X 和 Y。
其他需要考虑的前缀和后缀
关于是 Jr.、Sr.、Dr.、I、II ...
等等(不要让我开始非
美国名字)。
我正在使用 Lucene MemoryIndex 创建每个 Url 中文本的内存索引(剥离 html 标签),并使用 StandardAnalyzer 查询所有名称的列表,其中一个位于一次(100k 个名字,还有其他方法可以做到这一点吗?平均来说,这需要大约 8 秒。我的平均文本)。
一个主要问题是,为了消除噪音,我使用 0.01 的分数作为基本分数,如果文本包含“John Doe”并且在许多情况都未达到 0.01 阈值。
另一个问题是,如果我规范化所有名称并开始删除所有出现的 Dr. Mr. Mrs. 等,那么我就会开始错过像“Dr. John Edward II”这样的好匹配,并最终得到很多像“Mr. Mr. Mrs.”这样的垃圾匹配。约翰·爱德华”。
我知道 Lucene 可能也不是适合这项工作的工具,但到目前为止,它还没有被证明太糟糕。任何帮助表示赞赏。
I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:
Given some text (from a URL) AND a list people names, try to extract names of people from the text.
Note:
Names of people are not completely
normalized. e.g. Some are Mr. X, Mrs.
Y and some are just John Doe, X and Y.
Other prefixes and suffixes to think
about are Jr., Sr., Dr., I, II ...
etc. (dont let me get started with non
US names).
I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).
A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.
The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".
I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.
发布评论
评论(5)
NEE 是一个 NLP 任务,不是 lucene 的一部分。对于开源的,可以看看lingpipe和gate以及opennlp。有多种省钱的选择。
GATE 完全基于规则,很难用于高精度。为此,您需要一个统计引擎; lingpipe 有一个,但是你必须提供训练数据。我不了解 opennlp 在该领域的最新内容。
NEE is an NLP task that is not part of lucene. For open source, you can look at lingpipe and gate and opennlp. There are various for-money alternatives.
GATE is entirely rule-based, and will be hard to use for high precision. You'll need a statistical engine for that; lingpipe has one, but you have to supply the training data. I'm not up to date on the contents of opennlp in this area.
消除人名的歧义是出了名的困难。如果您有其他信息,例如位置或名称的共现,这将很有价值。但在作者消歧方面仍有大量工作要做,通常不能仅通过姓名列表来解决。
这是一个典型的项目 http://code.google.com/p/bibapp/wiki/作者权威。以及典型的出版物 http://www.springerlink.com/content/lk07h1m311t130w4/。
这是一个关于记录重复数据删除的项目,我们发现它对于作者消歧很有用 http://datamining.anu .edu.au/projects/linkage.html
Disambiguation of human names is notoriously difficult. If you have other information such as locations, or co-occurrence of names this will be valuable. But there is a lot of work still going into author disambiguation and it cannot normally be solved just from a list of names.
Here is a typical project http://code.google.com/p/bibapp/wiki/AuthorAuthorities . And a typical publication http://www.springerlink.com/content/lk07h1m311t130w4/.
Here is a project on record deduplications which we find useful for author disambiguation http://datamining.anu.edu.au/projects/linkage.html
这些项目可能对您有用:
http://nlp.stanford.edu/ner/index.shtml
http://cogcomp.cs.illinois.edu/page/software_view/4
These projects could be useful for you:
http://nlp.stanford.edu/ner/index.shtml
http://cogcomp.cs.illinois.edu/page/software_view/4
OpenNPL 很有用。 http://opennlp.apache.org/
该站点有文档和示例。
对于完全没有经验的人
《驯服文本》一书:http://www.manning.com/ingersoll/ 提供了很好的概述。您还可以从上面的链接下载本书的源代码。
OpenNPL is useful. http://opennlp.apache.org/
The site has documentation and examples.
For the completely uninitiated
The book Taming Text : http://www.manning.com/ingersoll/ provides a good overview. You can also download the source code from the book from the above link.
你可以试试这个..
http://alias-i.com/lingpipe/demos/tutorial /ne/read-me.html
文档很清楚,您也可以使用 DBPedia-Spotlight webservice...
http://spotlight.dbpedia.org/rest/spot/?text=
You can try this..
http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
Documenataion is clear, you can also use DBPedia-Spotlight webservice too...
http://spotlight.dbpedia.org/rest/spot/?text=