我正在搜索斯洛文尼亚语的词干算法,我可以将其与 Sphinx 搜索一起使用。
我想要实现的目标是,例如,在搜索“jabolka”时,我还想要包含“jabolko”、“jabolki”、“jabolk”等文档的结果。
我找到了一些关于斯洛文尼亚词干分析器存在的参考文献,但我找不到在哪里下载它,它甚至没有在任何地方出售...
我遇到的另一个选择是在 Sphinx 源配置中使用选项 wordforms
(http://sphinxsearch.com/docs/manual-0.9.9.html#conf -wordforms),但是构建我自己的词典太困难了,所以我想知道是否已经有任何可公开访问的词典?
如果斯洛文尼亚语词干分析器不可用,有人可以建议其他一些方法来实现类似的搜索结果吗?
I am searching stemming algorithm for Slovenian language that I can use with Sphinx search.
What I'm trying to achieve is for example when searching for 'jabolka', I also want results for documents containing 'jabolko', 'jabolki', 'jabolk', etc.
I found some references about existence of Slovenian stemmer, but I can't find where to download it, it's not even for sale anywhere...
Another option I've came across is using option wordforms
in Sphinx source config (http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms), but building my own dictionary would be too difficult, so I'm wondering are there any publicly accessible dictionaries available already?
In case Slovenian stemmer is not available, can somebody suggest some other approach of achieving similar search results?
发布评论
评论(3)
我设法通过以下步骤编译斯洛文尼亚语词干分析器:
stem_ISO_8859_2.sbl
stem_Unicode.sbl
(你必须找到 utf char斯洛文尼亚特殊字符的代码,如 ČŠŽĆ)编辑 /libstemmer 文件夹中的两个 .txt 文件并添加以下条目斯洛文尼亚语:
转到文件夹 / libstemmer 并运行:
<前><代码>./mkmodules.plmodules.h src_cmodules.txt../mkinc.mak
./mkmodules.pl 模块_utf8.h src_c 模块_utf8.txt ../mkinc_utf8.mak
这将生成稍后编译所需的文件。
make
(从解压文件的根目录)如果编译期间没有错误,您应该有 /src_c 文件夹和斯洛文尼亚语词干分析器的代码(在其他文件夹旁边)
解压最新的 sphinx 并将所有文件从 Snowball 项目复制到 sphinx / libstemmer_c 文件夹(不包括
libstemmer.o
和GNUmakefile
)编译 sphinx:
如果一切顺利,你应该有斯洛文尼亚语词干分析器可以让 sphinx 工作,你只需要在你的 sphinx 索引配置中启用它(在我的 Debian 上,它位于 /usr/local/等/sphinx.conf):
希望这对某人有帮助,我之前没有 autoconf 的经验,所以我花了一段时间才弄清楚这一点。
这个斯洛文尼亚词干分析器尚未在 http://snowball.tartarus.org 上正式发布,但从我的测试来看,它效果很好对于我的项目来说足够了。
I managed to compile slovenian stemmer in following steps:
stem_ISO_8859_2.sbl
stem_Unicode.sbl
(you have to find utf char codes for slovenian special chars like ČŠŽĆ)Edit both of .txt files in /libstemmer folder and add entries for slovenian:
go to folder /libstemmer and run:
This will generate files needed for compiling later.
make
(from root of unpacked files)If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)
Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding
libstemmer.o
andGNUmakefile
)compile sphinx:
if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):
Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.
This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.
我不确定这是否会达到您想要的效果,但我遇到了对名为 spelldump:
它需要“使用 ispell 或 MySpell 的字典文件” - 我发现 对斯洛文尼亚语 ispell 字典文件的引用,可能合适。
I'm not sure if this will do what you want, but I came across this reference to a tool called spelldump in the Sphinx documentation:
It requires "a dictionary file that uses ispell or MySpell" - I found a reference to a Slovenian ispell dictionary file, which might be suitable.
我还试图找到斯洛文尼亚语的词干分析器,但没有找到任何现有的解决方案。
我使用从未实现的 Snowball 在 Ruby 中构建了自己的词干分析器版本作为灵感。
它可以在 Github 上以 hajkr/slovene-stemmer 的形式获取。它远非完美,但它适用于大多数情况。
I was also trying to find a stemmer for the Slovene language but didn't come across any existing solutions.
I've built my own stemmer in Ruby using the never implemented Snowball version as inspiration.
It's available on Github as hajkr/slovene-stemmer. It's far from perfect, but it works for most cases.