Sphinx 的斯洛文尼亚词干分析器

发布于 2024-12-24 18:15:25 字数 493 浏览 6 评论 0 原文

我正在搜索斯洛文尼亚语的词干算法，我可以将其与 Sphinx 搜索一起使用。

我想要实现的目标是，例如，在搜索“jabolka”时，我还想要包含“jabolko”、“jabolki”、“jabolk”等文档的结果。

我找到了一些关于斯洛文尼亚词干分析器存在的参考文献，但我找不到在哪里下载它，它甚至没有在任何地方出售...

我遇到的另一个选择是在 Sphinx 源配置中使用选项 wordforms (http://sphinxsearch.com/docs/manual-0.9.9.html#conf -wordforms），但是构建我自己的词典太困难了，所以我想知道是否已经有任何可公开访问的词典？

如果斯洛文尼亚语词干分析器不可用，有人可以建议其他一些方法来实现类似的搜索结果吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放手` 2024-12-31 18:15:25

我设法通过以下步骤编译斯洛文尼亚语词干分析器：

下载 http://snowball.tartarus.org/dist/ Snowball_code.tgz（雪球的源代码）并解压它
从http://snowball.tartarus.org/archives/snowball-discuss/0725.html 并将其保存到文件夹 /algorithms/slovene 中步骤 1 中解压的项目中。文件名必须是 stem_ISO_8859_2.sbl
算法采用 ISO 编码，所以我将其转换为 UTF8 并将其保存为 stem_Unicode.sbl （你必须找到 utf char斯洛文尼亚特殊字符的代码，如 ČŠŽĆ)
编辑 /libstemmer 文件夹中的两个 .txt 文件并添加以下条目斯洛文尼亚语：
```
斯洛文尼亚语 UTF_8，ISO_8859_2 斯洛文尼亚语，sl，slv
```
编辑 /GNUmakefile 并添加斯洛文尼亚语（一次添加到 utf 语言列表，一次添加到 ISO_8859_2_algorithms）
转到文件夹 / libstemmer 并运行：

<前><代码>./mkmodules.plmodules.h src_cmodules.txt../mkinc.mak
./mkmodules.pl 模块_utf8.h src_c 模块_utf8.txt ../mkinc_utf8.mak

这将生成稍后编译所需的文件。
运行 make （从解压文件的根目录）
如果编译期间没有错误，您应该有 /src_c 文件夹和斯洛文尼亚语词干分析器的代码（在其他文件夹旁边）
```
stem_UTF_8_slovene.c
Stem_ISO_8859_2_slovene.c
...
```
解压最新的 sphinx 并将所有文件从 Snowball 项目复制到 sphinx / libstemmer_c 文件夹（不包括 libstemmer.o 和 GNUmakefile）

编译 sphinx:

touch 新闻自述文件作者变更日志
自动重新配置--强制--安装
./configure --with-libstemmer
制作
进行安装

如果一切顺利，你应该有斯洛文尼亚语词干分析器可以让 sphinx 工作，你只需要在你的 sphinx 索引配置中启用它（在我的 Debian 上，它位于 /usr/local/等/sphinx.conf）：
```
charset_type = utf-8
形态学 = libstemmer_slovene
```

希望这对某人有帮助，我之前没有 autoconf 的经验，所以我花了一段时间才弄清楚这一点。

这个斯洛文尼亚词干分析器尚未在 http://snowball.tartarus.org 上正式发布，但从我的测试来看，它效果很好对于我的项目来说足够了。

I managed to compile slovenian stemmer in following steps:

Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for snowball) and unpack it
Download slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it to unpacked project from step 1 in folder /algorithms/slovene. Name of the file has to be stem_ISO_8859_2.sbl
Algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you have to find utf char codes for slovenian special chars like ČŠŽĆ)
Edit both of .txt files in /libstemmer folder and add entries for slovenian:
```
slovene         UTF_8,ISO_8859_2        slovene,sl,slv
```
Edit /GNUmakefile and add slovene (once to list of languages for utf and once for ISO_8859_2_algorithms)

go to folder /libstemmer and run:

./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak
./mkmodules.pl modules_utf8.h src_c  modules_utf8.txt ../mkinc_utf8.mak

This will generate files needed for compiling later.

run make (from root of unpacked files)
If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)
```
stem_UTF_8_slovene.c
stem_ISO_8859_2_slovene.c
...
```
Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding libstemmer.o and GNUmakefile)

compile sphinx:

touch NEWS README AUTHORS ChangeLog
autoreconf --force --install
./configure --with-libstemmer
make
make install

if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):
```
charset_type = utf-8
morphology = libstemmer_slovene
```

Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.

This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.

回复收藏 0 原文

断桥再见 2024-12-31 18:15:25

我不确定这是否会达到您想要的效果，但我遇到了对名为 spelldump：

spelldump 是 Sphinx 软件包中的辅助工具之一。

它用于提取使用的字典文件的内容
ispell 或 MySpell 格式，可以帮助构建单词列表
单词形式 - 所有可能的形式都是为您预先构建的。

http://sphinxsearch.com/docs/current.html#ref-spelldump

它需要“使用 ispell 或 MySpell 的字典文件” - 我发现对斯洛文尼亚语 ispell 字典文件的引用，可能合适。