词库/词干词典

发布于 2024-09-29 07:57:22 字数 156 浏览 0 评论 0原文

看来我的 Google 功能让我失望了。

有谁知道仅包含单词基础的免费单词基础词典?所以,对于像草莓这样的东西,它就会有草莓。但不包含缩写、拼写错误或替代拼写(例如 UK 与 US)?任何可以在 Java 中快速使用的东西都很好,但只是映射的文本文件或任何可以读入的东西都会有帮助。

It seems my Google-fu is failing me.

Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

此岸叶落 2024-10-06 07:57:22

这称为词形还原,而您所说的“词根”称为引理。 morpha 和斯坦福 POS 标记器中的重新实现做这个。然而,两者都需要词性标记输入来解决自然语言中固有的歧义。

(词性标记意味着确定单词类别,例如名词、动词。我一直假设您需要一个处理英语的工具。)

编辑:由于您要使用它进行搜索,因此这里有一个一些提示:

  • 简单的英语词干在搜索引擎领域的声誉褒贬不一。有时它有效,但通常无效。
  • 自动拼写更正可能效果更好。这就是 Google 所做的事情。不过,如果你想做得正确,那么计算时间就很昂贵。
  • 词形还原可能会带来好处,但可能只有在您对单词和词条进行索引和搜索时才能实现。 (同样的建议也适用于词干提取。)
  • 这是一个用于进行词形还原的 Lucene 插件

(前面的评论基于我自己的研究;我写了关于搜索引擎中非常嘈杂的数据的词形还原的硕士论文。)

This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

Edit: since you're going to use this for search, here's a few tips:

  • Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
  • Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
  • Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
  • Here's a plugin for Lucene that does lemmatization.

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)

你的他你的她 2024-10-06 07:57:22

这并不完全是您所要求的,但是 关于词干提取的维基百科很有启发性,并且包含免费词干提取程序的链接数量。其中大概应该包括词干列表

This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems

狼性发作 2024-10-06 07:57:22

http://www.puzzlers.org/dokuwiki/doku .php?id=solving:wordlists:about:start

此页面上的 Miriam Websters Collegiate 9th Edition 链接包含仅包含单词根形式的单词文件。里面有草莓,没有草莓。同样,“添加”在其中,而“添加”则不在其中。不确定这是否是您想要的,但这对我很有帮助。

http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

The Miriam Websters Collegiate 9th Edition link on this page contains a word file of only the root forms of words. Strawberry is in there, Strawberries is not. Likewise "add" is in there "adding" is not. Not sure if this is what you are after, but it was helpful for me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文