Xapian 多语言搜索带有停用词?

发布于 2024-07-26 21:59:22 字数 249 浏览 11 评论 0原文

我有两个 Xapian 数据库,一个为“EN”,另一个为“DE”,假设前者包含一些英文文档,后者包含一些德文文档。

如果我希望用户能够同时搜索这两个数据库,我可以轻松加载这两个数据库。 但是,似乎我只能使用一个词干分析器和一组停用词?

没有办法实例化英语词干分析器并将其仅应用于来自“EN”数据库的那些结果吗? 没有办法用英语单词创建一个 Stopper,并且让它只适用于来自“EN”数据库的那些结果吗?

这可以吗?

I have two Xapian databases, let's call one "EN" and the other "DE", and let's say the former contains some documents in English, and the latter in German.

If I want users to be able to search both at once, I can easily load both of the databases. However, it seems like I can only use one stemmer and set of stop words?

There's no way to instantiate an English-language stemmer and have it apply just to those results that come from the "EN" database? There's no way to create a Stopper with english words, and have it apply just to those results that come from the "EN" database?

Can this be right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

独自唱情﹋歌 2024-08-02 21:59:22

仅当您知道要提取的文本的语言时,词干提取才有用。 如果您使用词干创建了 Xapian 数据库(即 Xapian 数据库存储原始单词的词干形式),那么您将指定一种语言。

然而,在搜索时,您还需要知道正确的词干语言。 如果您的用户用英语输入查询,则在将查询应用到英语数据库之前,您必须使用英语进行词干。 这同样适用于德语。 如果您想搜索每个数据库,也许您应该根据每个用户请求创建两个单独的、特定于语言的查询。

但是请记住,最初以德语输入的查询,但随后使用英语词干分析器进行词干分析,可能会产生一些奇怪的结果 - 如果您有任何方法可以找出用户在查询时使用的语言,那么这可以用于应用正确的词干提取器。

HTH - 顺便说一句,Xapian 讨论邮件列表(请参阅 www.xapian.org)是提出此类问题的好地方。

查理

Stemming is only useful if you know the language of the text you're stemming. If you've created your Xapian databases with stemming (i.e., the Xapian databases are storing stemmed forms of the original words) then you would have specified a language.

However at search time, you also need to know the language to stem correctly. If your users enter a query in English, you must stem in English before applying the query to the English database. The same applies for German. If you want to search each database perhaps you should create two separate, language-specific queries from each user request.

However bear in mind that a query originally entered in German, but then stemmed with an English stemmer, may produce some odd results - if you have any way of finding out what language your users are using at query time then this can be used to apply the correct stemmers.

HTH - by the way, the Xapian-discuss mailing list (see www.xapian.org) is a good place to ask this kind of question.

Charlie

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文