如何进行词干提取或词形还原?
我尝试过 PorterStemmer 和 Snowball,但两者都不能处理所有单词,缺少一些非常常见的单词。
我的测试词是:“猫跑跑仙人掌仙人掌仙人掌社区社区”,两者都答对了一半。
另请参阅:
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(22)
如果您了解 Python,自然语言工具包 (NLTK) 有一个非常强大的词形还原器,它利用 WordNet。
请注意,如果您是第一次使用此词形还原器,则必须在使用之前下载语料库。 这可以通过以下方式完成:
您只需执行一次。 假设您现在已经下载了语料库,它的工作原理如下:
nltk 中还有其他词形还原器.stem模块,但我自己还没有尝试过。
If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.
Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:
You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:
There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.
我使用 stanford nlp 来执行词形还原。 在过去的几天里,我一直被类似的问题困扰。 非常感谢 stackoverflow 帮我解决了这个问题。
如果稍后在分类器中使用,使用停用词来最小化输出引理也可能是一个好主意。 请查看 John Conwell 编写的 coreNlp 扩展。
I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .
It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.
我在这个雪球演示网站上尝试了您的术语列表,结果看起来不错......
词干提取器应该将单词的变形形式转化为某个共同的词根。 让词根成为“正确的”字典单词并不是词干分析器的工作。 为此,您需要查看形态/正交分析器< /a>.
我认为这个问题或多或少是同一件事,Kaarel 的回答这个问题是我从哪里获取第二个链接的。
I tried your list of terms on this snowball demo site and the results look okay....
A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.
I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.
词干提取器与词形还原器的争论仍在继续。 这是一个优先考虑精度而非效率的问题。 您应该进行词形还原以获得具有语言意义的单位,并使用最少的计算能力,并且仍然在同一键下索引单词及其变体。
请参阅 Stemmers 与 Lemmatizers
下面是一个使用 python NLTK 的示例:
The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.
See Stemmers vs Lemmatizers
Here's an example with python NLTK:
Martin Porter 的官方页面包含 Porter Stemmer in PHP 以及 其他语言。
如果您真的很重视良好的词干提取,那么您需要从波特算法之类的东西开始,通过添加规则来修复数据集常见的错误情况来完善它,然后最后在规则中添加很多例外情况。 这可以通过键/值对(dbm/散列/字典)轻松实现,其中键是要查找的单词,值是替换原始单词的词干单词。 我曾经工作过的一个商业搜索引擎最终出现了 800 个修改后的波特算法的例外情况。
Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.
If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.
根据我遇到的 Stack Overflow 和博客上的各种答案,这就是我正在使用的方法,并且它似乎可以很好地返回真实单词。 其想法是将传入的文本拆分为单词数组(使用您想要的任何方法),然后找到这些单词的词性 (POS),并使用它来帮助词干和词形还原。
您上面的示例效果不太好,因为无法确定 POS。 然而,如果我们使用真正的句子,事情就会好得多。
Based on various answers on Stack Overflow and blogs I've come across, this is the method I'm using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you'd like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.
You're sample above doesn't work too well, because the POS can't be determined. However, if we use a real sentence, things work much better.
http://wordnet.princeton.edu/man/morph.3WN
对于很多在我的项目中,我更喜欢基于词典的 WordNet 词形还原器,而不是更激进的波特词干提取。
http://wordnet.princeton.edu/links#PHP 有一个 PHP 接口的链接WN API。
http://wordnet.princeton.edu/man/morph.3WN
For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.
http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.
查看 WordNet,一个大型英语词汇数据库:
http://wordnet.princeton.edu/
有多种语言的 API 可以访问它。
Look into WordNet, a large lexical database for the English language:
http://wordnet.princeton.edu/
There are APIs for accessing it in several languages.
这看起来很有趣:
麻省理工学院 Java WordnetStemmer:
http://projects.csail.mit.edu /jwi/api/edu/mit/jwi/morph/WordnetStemmer.html
This looks interesting:
MIT Java WordnetStemmer:
http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html
看一下 LemmaGen - 用 C# 3.0 编写的开源库。
您的测试词 (http://lemmatise.ijs.si/Services) 的结果
Take a look at LemmaGen - open source library written in C# 3.0.
Results for your test words (http://lemmatise.ijs.si/Services)
用于词形还原的顶级 Python 包(排名不分先后)是:spacy、nltk、gensim、pattern、
CoreNLP
和TextBlob
。 我更喜欢 spaCy 和 gensim 的实现(基于模式),因为它们识别单词的 POS 标签并自动分配适当的引理。 给出了更多相关的引理,保持了含义的完整。如果您计划使用 nltk 或 TextBlob,则需要手动查找正确的 POS 标签并找到正确的引理。
使用 spaCy 的词形还原示例:
使用 Gensim 的词形还原示例:
上面的示例借自此 词形还原 页面。
The top python packages (in no specific order) for lemmatization are:
spacy
,nltk
,gensim
,pattern
,CoreNLP
andTextBlob
. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.
Lemmatization Example with spaCy:
Lemmatization Example With Gensim:
The above examples were borrowed from in this lemmatization page.
如果我可以引用我对 StompChicken 提到的问题的回答:
因为他们没有理解由于它们不是语言的一部分并且不是从术语词典中运行的,因此它们无法识别和适当响应不规则的情况,例如“run”/“ran”。
如果您需要处理不规则的情况,则需要选择不同的方法或使用您自己的自定义更正字典来增强您的词干提取,以便在词干分析器完成其工作后运行。
If I may quote my answer to the question StompChicken mentioned:
As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".
If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.
NLTK 中词干分析器的最新版本是 Snowball。
您可以在此处找到有关如何使用它的示例:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo
The most current version of the stemmer in NLTK is Snowball.
You can find examples on how to use it here:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo
您可以使用 Morpha 词干分析器。 如果您打算使用,华盛顿大学已将 Morpha 词干分析器上传到 Maven 中心它来自 Java 应用程序。 有一个包装器可以使它更容易使用。 您只需将其添加为依赖项并使用 edu.washington.cs.knowitall.morpha.MorphaStemmer 类即可。 实例是线程安全的(最初的 JFlex 有不必要的局部变量的类字段)。 实例化一个类并运行
morpha
和您想要提取词干的单词。You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the
edu.washington.cs.knowitall.morpha.MorphaStemmer
class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and runmorpha
and the word you want to stem.搜索 Lucene,我不确定是否有 PHP 端口,但我知道 Lucene 可用于许多平台。 Lucene 是一个 OSS(来自 Apache)索引和搜索库。 当然,它和社区的额外内容可能会有一些有趣的东西值得一看。 至少您可以了解它是如何用一种语言完成的,这样您就可以将“想法”翻译成 PHP。
Do a search for Lucene, im not sure if theres a PHP port but I do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how it's done in one language so you can translate the "idea" into PHP.
.Net lucene 有一个内置的 porter 词干分析器。 你可以尝试一下。 但请注意,波特词干提取在推导引理时不考虑单词上下文。 (浏览一下算法及其实现,你就会明白它是如何工作的)
.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)
Martin Porter 编写了 Snowball(一种词干算法语言),并在 Snowball 中重写了“英语词干分析器”。 C 和 Java 有一个英语词干分析器。
他明确指出,Porter Stemmer 仅由于历史原因而被重新实现,因此针对 Porter Stemmer 测试词干正确性将得到您(应该)已经知道的结果。
Porter 博士建议使用 English 或 Porter2 词干分析器而不是 Porter 词干分析器。 英语词干分析器是 演示网站 中实际使用的 @StompChicken之前已经回答过。
Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.
He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.
Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as @StompChicken has answered earlier.
在 Java 中,我使用 tartargus-snowball 来词干
Maven:
<示例代码:
In Java, i use tartargus-snowball to stemming words
Maven:
Sample code:
在这里试试这个:http://www.twinword.com/lemmatizer.php
我输入了您的在演示中查询
“cats running ran cactus cactuses cacti Community communications”
并得到["cat", "running", "run", "cactus", "cactus", "cactus" , "community", "community"]
带有可选标志ALL_TOKENS
。示例代码
这是一个 API,因此您可以从任何环境连接到它。 PHP REST 调用可能如下所示。
Try this one here: http://www.twinword.com/lemmatizer.php
I entered your query in the demo
"cats running ran cactus cactuses cacti community communities"
and got["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"]
with the optional flagALL_TOKENS
.Sample Code
This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.
我强烈建议使用 Spacy (基本文本解析和标记)和 Textacy(建立在 Spacy 之上的更高级别的文本处理)。
词形还原的单词默认情况下在 Spacy 中可用作为标记的
.lemma_
在使用 textacy 进行许多其他文本预处理时,可以对属性和文本进行词形还原。 例如创建术语包时 或单词或通常在执行某些操作之前需要它的处理。我鼓励您在编写任何代码之前检查两者,因为这可能会节省您很多时间!
I highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy).
Lemmatized words are available by default in Spacy as a token's
.lemma_
attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy. For example while creating a bag of terms or words or generally just before performing some processing that requires it.I'd encourage you to check out both before writing any code, as this may save you a lot of time!