Sphinx 的斯洛文尼亚词干分析器

发布于 2024-12-24 18:15:25 字数 493 浏览 1 评论 0 原文

我正在搜索斯洛文尼亚语的词干算法,我可以将其与 Sphinx 搜索一起使用。

我想要实现的目标是,例如,在搜索“jabolka”时,我还想要包含“jabolko”、“jabolki”、“jabolk”等文档的结果。

我找到了一些关于斯洛文尼亚词干分析器存在的参考文献,但我找不到在哪里下载它,它甚至没有在任何地方出售...

我遇到的另一个选择是在 Sphinx 源配置中使用选项 wordforms (http://sphinxsearch.com/docs/manual-0.9.9.html#conf -wordforms),但是构建我自己的词典太困难了,所以我想知道是否已经有任何可公开访问的词典?


如果斯洛文尼亚语词干分析器不可用,有人可以建议其他一些方法来实现类似的搜索结果吗?

I am searching stemming algorithm for Slovenian language that I can use with Sphinx search.

What I'm trying to achieve is for example when searching for 'jabolka', I also want results for documents containing 'jabolko', 'jabolki', 'jabolk', etc.

I found some references about existence of Slovenian stemmer, but I can't find where to download it, it's not even for sale anywhere...

Another option I've came across is using option wordforms in Sphinx source config (http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms), but building my own dictionary would be too difficult, so I'm wondering are there any publicly accessible dictionaries available already?


In case Slovenian stemmer is not available, can somebody suggest some other approach of achieving similar search results?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

放手` 2024-12-31 18:15:25

我设法通过以下步骤编译斯洛文尼亚语词干分析器:

  1. 下载 http://snowball.tartarus.org/dist/ Snowball_code.tgz(雪球的源代码)并解压它
  2. http://snowball.tartarus.org/archives/snowball-discuss/0725.html 并将其保存到文件夹 /algorithms/slovene 中步骤 1 中解压的项目中。文件名必须是 stem_ISO_8859_2.sbl
  3. 算法采用 ISO 编码,所以我将其转换为 UTF8 并将其保存为 stem_Unicode.sbl (你必须找到 utf char斯洛文尼亚特殊字符的代码,如 ČŠŽĆ)
  4. 编辑 /libstemmer 文件夹中的两个 .txt 文件并添加以下条目斯洛文尼亚语:

    斯洛文尼亚语 UTF_8,ISO_8859_2 斯洛文尼亚语,sl,slv
    
  5. 编辑 /GNUmakefile 并添加斯洛文尼亚语(一次添加到 utf 语言列表,一次添加到 ISO_8859_2_algorithms)
  6. 转到文件夹 / libstemmer 并运行:

    <前><代码>./mkmodules.plmodules.h src_cmodules.txt../mkinc.mak
    ./mkmodules.pl 模块_utf8.h src_c 模块_utf8.txt ../mkinc_utf8.mak

    这将生成稍后编译所需的文件。

  7. 运行 make (从解压文件的根目录)
  8. 如果编译期间没有错误,您应该有 /src_c 文件夹和斯洛文尼亚语词干分析器的代码(在其他文件夹旁边)

    stem_UTF_8_slovene.c
    Stem_ISO_8859_2_slovene.c
    ...
    
  9. 解压最新的 sphinx 并将所有文件从 Snowball 项目复制到 sphinx / libstemmer_c 文件夹(不包括 libstemmer.oGNUmakefile

  10. 编译 sphinx:

    touch 新闻自述文件作者变更日志
    自动重新配置--强制--安装
    ./configure --with-libstemmer
    制作
    进行安装
    
  11. 如果一切顺利,你应该有斯洛文尼亚语词干分析器可以让 sphinx 工作,你只需要在你的 sphinx 索引配置中启用它(在我的 Debian 上,它位于 /usr/local/等/sphinx.conf):

    charset_type = utf-8
    形态学 = libstemmer_slovene
    

希望这对某人有帮助,我之前没有 autoconf 的经验,所以我花了一段时间才弄清楚这一点。

这个斯洛文尼亚词干分析器尚未在 http://snowball.tartarus.org 上正式发布,但从我的测试来看,它效果很好对于我的项目来说足够了。

I managed to compile slovenian stemmer in following steps:

  1. Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for snowball) and unpack it
  2. Download slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it to unpacked project from step 1 in folder /algorithms/slovene. Name of the file has to be stem_ISO_8859_2.sbl
  3. Algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you have to find utf char codes for slovenian special chars like ČŠŽĆ)
  4. Edit both of .txt files in /libstemmer folder and add entries for slovenian:

    slovene         UTF_8,ISO_8859_2        slovene,sl,slv
    
  5. Edit /GNUmakefile and add slovene (once to list of languages for utf and once for ISO_8859_2_algorithms)
  6. go to folder /libstemmer and run:

    ./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak
    ./mkmodules.pl modules_utf8.h src_c  modules_utf8.txt ../mkinc_utf8.mak
    

    This will generate files needed for compiling later.

  7. run make (from root of unpacked files)
  8. If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)

    stem_UTF_8_slovene.c
    stem_ISO_8859_2_slovene.c
    ...
    
  9. Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding libstemmer.o and GNUmakefile)

  10. compile sphinx:

    touch NEWS README AUTHORS ChangeLog
    autoreconf --force --install
    ./configure --with-libstemmer
    make
    make install
    
  11. if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):

    charset_type = utf-8
    morphology = libstemmer_slovene
    

Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.

This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.

断桥再见 2024-12-31 18:15:25

我不确定这是否会达到您想要的效果,但我遇到了对名为 spelldump

spelldump 是 Sphinx 软件包中的辅助工具之一。

它用于提取使用的字典文件的内容
ispell 或 MySpell 格式,可以帮助构建单词列表
单词形式 - 所有可能的形式都是为您预先构建的。

http://sphinxsearch.com/docs/current.html#ref-spelldump

它需要“使用 ispell 或 MySpell 的字典文件” - 我发现 对斯洛文尼亚语 ispell 字典文件的引用,可能合适。

I'm not sure if this will do what you want, but I came across this reference to a tool called spelldump in the Sphinx documentation:

spelldump is one of the helper tools within the Sphinx package.

It is used to extract the contents of a dictionary file that uses
ispell or MySpell format, which can help build word lists for
wordforms - all of the possible forms are pre-built for you.

http://sphinxsearch.com/docs/current.html#ref-spelldump

It requires "a dictionary file that uses ispell or MySpell" - I found a reference to a Slovenian ispell dictionary file, which might be suitable.

人心善变 2024-12-31 18:15:25

我还试图找到斯洛文尼亚语的词干分析器,但没有找到任何现有的解决方案。

我使用从未实现的 Snowball 在 Ruby 中构建了自己的词干分析器版本作为灵感。

它可以在 Github 上以 hajkr/slovene-stemmer 的形式获取。它远非完美,但它适用于大多数情况。

I was also trying to find a stemmer for the Slovene language but didn't come across any existing solutions.

I've built my own stemmer in Ruby using the never implemented Snowball version as inspiration.

It's available on Github as hajkr/slovene-stemmer. It's far from perfect, but it works for most cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文