产生真实单词的词干算法

发布于 2024-07-06 20:41:29 字数 691 浏览 22 评论 0原文

我需要获取一段文本并从中提取“标签”列表。 其中大部分都是非常简单的。 不过,我现在需要一些帮助来阻止生成的单词列表以避免重复。 示例:社区/社区

我使用了 Porter Stemmer 算法的实现(顺便说一下,我是用 PHP 编写的):

http://tartarus.org/~martin/PorterStemmer/php.txt

这在一定程度上是有效的,但不会返回“真实”的单词。 上面的例子源于“commun”。

我尝试过“Snowball”(在另一个 Stack Overflow 线程中建议)。

http://snowball.tartarus.org/demo.php

对于我的示例(社区/社区) ,Snowball 源于“communiti”。

问题

还有其他的词干算法可以做到这一点吗? 还有其他人解决了这个问题吗?

我目前的想法是,我可以使用词干算法来避免重复,然后选择我遇到的最短的单词作为实际要显示的单词。

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

For my example (community / communities), Snowball stems to "communiti".

Question

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

北笙凉宸 2024-07-13 20:41:29

如果我理解正确,那么您需要的不是词干分析器,而是词形还原器。 Lemmatizer 是一款了解诸如 -ies-ed 等词尾以及诸如 书面 等特殊词形的工具。Lemmatizer 可以映射将单词形式输入到其引理中,保证是一个“真实”单词。

英语有很多词形还原器,但我只使用过 morpha
Morpha 只是一个大的 lex 文件,您可以将其编译成可执行文件。
用法示例:

$ cat test.txt 
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community

您可以从 http://www 获取吗啡.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

There are many lemmatizers for English, I've only used morpha though.
Morpha is just a big lex-file which you can compile into an executable.
Usage example:

$ cat test.txt 
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community

You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

追我者格杀勿论 2024-07-13 20:41:29

这里的核心问题是词干算法在语音基础上纯粹基于语言的拼写规则进行操作,而没有真正理解它们所使用的语言。 要生成真实的单词,您可能必须将词干分析器的输出与某种形式的查找函数合并,以将词干转换回真实的单词。 我基本上可以看到两种可能的方法来做到这一点:

  1. 找到或创建一个大字典,将每个可能的词干映射回实际的单词。 (例如,communiti -> Community)
  2. 创建一个函数,将每个词干与缩减为该词干的单词列表进行比较,并尝试确定哪个最相似。 (例如,将“communiti”与“community”和“communities”进行比较,这样“community”将被认为是更相似的选项)就

我个人而言,我认为我这样做的方式将是#1的动态形式,通过记录检查的每个单词及其词干来建立自定义词典数据库,然后假设最常见的单词是应该使用的单词。 (例如,如果我的源文本正文使用“社区”比“社区”更频繁,则映射社区 -> 社区。)基于字典的方法通常会更准确,并且基于词干分析器输入构建它将提供根据您的文本定制结果,主要缺点是所需的空间,但如今这通常不是问题。

The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
  2. Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

愚人国度 2024-07-13 20:41:29

嘿,我不知道这是否为时已晚,但只有一个 PHP 词干脚本可以生成真正的单词:http://phpmorphy.sourceforge。 net/ – 我花了很长时间才找到它。 所有其他词干分析器都必须进行编译,即使之后它们也只能根据波特算法工作,该算法产生词干,而不是引理(即社区=社区)。 PhpMorphy one 运行得非常好,它很容易安装和初始化,并且有英语、俄语、德语、乌克兰语和爱沙尼亚语词典。 它还附带一个可用于编译其他词典的脚本。 该文档是俄语的,但是通过谷歌翻译应该很容易。

Hey I don't know if that's perhaps too late, but there is only one PHP stemming script that produces real words: http://phpmorphy.sourceforge.net/ – it took me ages to find it. All other stemmers have to be compiled and even after that they only work according to Porter algorithm, which produces stems, not lemmas (i.e. community = communiti). PhpMorphy one works perfectly well, it's easy to install and initialize, and has English, Russian, German, Ukrainian and Estonian dictionaries. It also comes with a script that you can use to compile other dictionaries. The documentation is in Russian, but put it through Google translate and it should be easy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文