产生真实单词的词干算法
我需要获取一段文本并从中提取“标签”列表。 其中大部分都是非常简单的。 不过,我现在需要一些帮助来阻止生成的单词列表以避免重复。 示例:社区/社区
我使用了 Porter Stemmer 算法的实现(顺便说一下,我是用 PHP 编写的):
http://tartarus.org/~martin/PorterStemmer/php.txt
这在一定程度上是有效的,但不会返回“真实”的单词。 上面的例子源于“commun”。
我尝试过“Snowball”(在另一个 Stack Overflow 线程中建议)。
http://snowball.tartarus.org/demo.php
对于我的示例(社区/社区) ,Snowball 源于“communiti”。
问题
还有其他的词干算法可以做到这一点吗? 还有其他人解决了这个问题吗?
我目前的想法是,我可以使用词干算法来避免重复,然后选择我遇到的最短的单词作为实际要显示的单词。
I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities
I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):
http://tartarus.org/~martin/PorterStemmer/php.txt
This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".
I've tried "Snowball" (suggested within another Stack Overflow thread).
http://snowball.tartarus.org/demo.php
For my example (community / communities), Snowball stems to "communiti".
Question
Are there any other stemming algorithms that will do this? Has anyone else solved this problem?
My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果我理解正确,那么您需要的不是词干分析器,而是词形还原器。 Lemmatizer 是一款了解诸如 -ies、-ed 等词尾以及诸如 书面 等特殊词形的工具。Lemmatizer 可以映射将单词形式输入到其引理中,保证是一个“真实”单词。
英语有很多词形还原器,但我只使用过
morpha
。Morpha 只是一个大的 lex 文件,您可以将其编译成可执行文件。
用法示例:
您可以从 http://www 获取吗啡.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.
There are many lemmatizers for English, I've only used
morpha
though.Morpha is just a big lex-file which you can compile into an executable.
Usage example:
You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
这里的核心问题是词干算法
在语音基础上纯粹基于语言的拼写规则进行操作,而没有真正理解它们所使用的语言。 要生成真实的单词,您可能必须将词干分析器的输出与某种形式的查找函数合并,以将词干转换回真实的单词。 我基本上可以看到两种可能的方法来做到这一点:我个人而言,我认为我这样做的方式将是#1的动态形式,通过记录检查的每个单词及其词干来建立自定义词典数据库,然后假设最常见的单词是应该使用的单词。 (例如,如果我的源文本正文使用“社区”比“社区”更频繁,则映射社区 -> 社区。)基于字典的方法通常会更准确,并且基于词干分析器输入构建它将提供根据您的文本定制结果,主要缺点是所需的空间,但如今这通常不是问题。
The core issue here is that stemming algorithms operate
on a phonetic basispurely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.
嘿,我不知道这是否为时已晚,但只有一个 PHP 词干脚本可以生成真正的单词:http://phpmorphy.sourceforge。 net/ – 我花了很长时间才找到它。 所有其他词干分析器都必须进行编译,即使之后它们也只能根据波特算法工作,该算法产生词干,而不是引理(即社区=社区)。 PhpMorphy one 运行得非常好,它很容易安装和初始化,并且有英语、俄语、德语、乌克兰语和爱沙尼亚语词典。 它还附带一个可用于编译其他词典的脚本。 该文档是俄语的,但是通过谷歌翻译应该很容易。
Hey I don't know if that's perhaps too late, but there is only one PHP stemming script that produces real words: http://phpmorphy.sourceforge.net/ – it took me ages to find it. All other stemmers have to be compiled and even after that they only work according to Porter algorithm, which produces stems, not lemmas (i.e. community = communiti). PhpMorphy one works perfectly well, it's easy to install and initialize, and has English, Russian, German, Ukrainian and Estonian dictionaries. It also comes with a script that you can use to compile other dictionaries. The documentation is in Russian, but put it through Google translate and it should be easy.