当前位置：文江博客话题详情

是否有克罗地亚语词干算法的实现？

发布于 2024-11-17 11:35:53 字数 103 浏览 5 评论 0原文

我正在寻找克罗地亚语词干算法的实现。理想情况下使用 Java，但我也接受任何其他语言。

是否有一个讲英语的开发人员社区正在开发克罗地亚语的搜索应用程序？

谢谢，

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冰雪之触 2024-11-24 11:35:53

斯拉夫语言具有高度的屈折性。最准确和快速的方法是规则和大型映射/字典的组合。

工作已经完成，但被搁置了。克罗地亚语形态词典会有所帮助，但它的 API 速度很慢。波斯尼亚语、塞尔维亚语和克罗地亚语之间可以找到更多的工作，而不仅仅是克罗地亚语。

大型映射并不总是方便（并且可以从映射/字典/语料库有效地构建更好的规则转换器）。

使用 Hunspell 和 affix 文件实现可能是获得社区和 java 支持的好方法。例如。 Google 搜索：hr_hr.aff

未测试：应该能够反转所有单词，构建结束字符的字典树，使用一些规则（例如LCS）进行遍历并使用语料库文本构建准确的统计转换器。

我能做的最好的就是一些Python：

import hunspell
hs = hunspell.HunSpell(
         '/usr/share/myspell/hr_HR.dic', 
         '/usr/share/myspell/hr_HR.aff')

# The following should return ['hrvatska']:
print hs.stem('hrvatski')

Slavic languages are highly inflective. The most accurate and fast approach would be a combination of rules and large mappings/dictionaries.

Work has been done, but it has been held back. The Croatian morphological lexicon will help, but it's behind a slow API. More work can be found between Bosnian, Serbian and Croatian, than just Croatian alone.

Large mappings aren't always convenient (and one could effectively build a better rule transformer from the mapping/dictionaries/corpus).

Implementing using Hunspell and affix files could be a great way to get the community and java support. Eg. Google search: hr_hr.aff

Not tested: One should be able to reverse all the words, build a trie of the ending characters, traverse using some rules (eg LCS) and build an accurate statistical transformer using corpus text.

Best I can do is some python:

import hunspell
hs = hunspell.HunSpell(
         '/usr/share/myspell/hr_HR.dic', 
         '/usr/share/myspell/hr_HR.aff')

# The following should return ['hrvatska']:
print hs.stem('hrvatski')

回复收藏 0 原文