词库/词干词典
看来我的 Google 功能让我失望了。
有谁知道仅包含单词基础的免费单词基础词典?所以,对于像草莓这样的东西,它就会有草莓。但不包含缩写、拼写错误或替代拼写(例如 UK 与 US)?任何可以在 Java 中快速使用的东西都很好,但只是映射的文本文件或任何可以读入的东西都会有帮助。
It seems my Google-fu is failing me.
Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这称为词形还原,而您所说的“词根”称为引理。
morpha
和斯坦福 POS 标记器中的重新实现做这个。然而,两者都需要词性标记输入来解决自然语言中固有的歧义。(词性标记意味着确定单词类别,例如名词、动词。我一直假设您需要一个处理英语的工具。)
编辑:由于您要使用它进行搜索,因此这里有一个一些提示:
(前面的评论基于我自己的研究;我写了关于搜索引擎中非常嘈杂的数据的词形还原的硕士论文。)
This is called lemmatization, and what you call the "base of a word" is called a lemma.
morpha
and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)
Edit: since you're going to use this for search, here's a few tips:
(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)
这并不完全是您所要求的,但是 关于词干提取的维基百科很有启发性,并且包含免费词干提取程序的链接数量。其中大概应该包括词干列表
This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems
http://www.puzzlers.org/dokuwiki/doku .php?id=solving:wordlists:about:start
此页面上的 Miriam Websters Collegiate 9th Edition 链接包含仅包含单词根形式的单词文件。里面有草莓,没有草莓。同样,“添加”在其中,而“添加”则不在其中。不确定这是否是您想要的,但这对我很有帮助。
http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start
The Miriam Websters Collegiate 9th Edition link on this page contains a word file of only the root forms of words. Strawberry is in there, Strawberries is not. Likewise "add" is in there "adding" is not. Not sure if this is what you are after, but it was helpful for me.