哪种分析仪适合我的情况?休眠搜索案例

发布于 2024-11-13 07:01:57 字数 689 浏览 3 评论 0原文

我们正在运行一个书籍搜索应用程序。它是通过hibernate搜索实现的。

书籍实体定义如下:

@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;

@Field
private String description;
...}

如果用户搜索书籍名称,例如输入Microsoft access 2007,则返回标题或描述包含microsoft、access 或2007 的书籍。这正是我们所期望的。由于关键字 2007,有些书完全不相关。我正在寻找一种解决方案来了解每个关键字的重要性。在这种情况下,2007 年在搜索中的重要性就不那么重要了。但对于该搜索,对于 microsoft、access 或 2007 没有区别。

第二个用户案例:是否有一个好的分析器可以用于索引和查询以支持多个短语?我认为休眠搜索的默认分析器只是将搜索词标记为单个单词?

如果搜索词是 microsoft access 2007,如果结果包含“microsoft access”,则结果得分最高,

其他搜索示例:“盐湖城”、“美国”,如果仅匹配 salt、city 或 Lake 或 at,则不会出现预期结果至少,他们应该落后于“盐湖城”的成绩。

有人可以给我一些线索吗?

谢谢!

We are running a search app for book. It is implemented by hibernate search.

Book entity is defined as following:

@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;

@Field
private String description;
...}

If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?

If search words is microsoft access 2007, results have best score if they contains "microsoft access",

the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

Can anyone offer me some clues?

thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

彩虹直至黑白 2024-11-20 07:01:57

Lucene 应该已经忽略了频繁出现的术语,因此不能很好地区分文档。如果您想增强这种效果,您有几种选择:

  1. 更改默认的相似度函数,并使用新函数对术语进行不同的加权
  2. 通过首先查找查询中低 df(高 idf)术语的数量包含给定术语的文档,并相应地调整该术语的权重
  3. 编写一个分类器,可以先验地决定哪些术语不会那么有效(例如,年份数字),并相应地调整其权重
  4. 使用 WordNet 或 Wikipedia 等内容作为分类器词组来源(例如,领导技能),您将其索引为单个标记。这将涉及由分析器配置的修改后的 TokenStream。

Lucene should already discount terms that occur frequently and thus don't discriminate well among documents. If you want to increase that effect, you have a few choices:

  1. Change the similarity function from the default, and use the new function to weight terms differently
  2. Boost low-df (high idf) terms in the query by first looking up the number of documents that contain a given term, and adjusting that term's weight accordingly
  3. Write a classifier that can a priori decide which terms are not going to be as effective (e.g., year numbers), and adjust their weight accordingly
  4. Use something like WordNet or Wikipedia as a source of phrases (e.g., leadership skills) that you index as a single token. This will involve a modified TokenStream as configured by your analyzer.
层林尽染 2024-11-20 07:01:57

我不知道如何区分 2007 年的好酒和坏酒。

您可以做的一件事是使用忽略数字进行描述的分析器,但使用常规分析器进行标题。这样,只会选取标题中的数字。实际上,它不是一个完整的分析器,而是一个简单的过滤器,您可以编写它并将其添加到分析器堆栈中。

您还可以对描述进行两次索引,一次忽略数字,一次不忽略它们。然后,您可以在查询时使用提升因子来搜索这两个字段,但给予带有数字的字段较低的优先级。

另一个解决方案是忽略自定义过滤器中的一些数字模式(即年份样式数字、单位数字等):这些将是您想要忽略的最常见的嘈杂数字类型(我认为这就是我首先要考虑的) )。

至于短语搜索,只需使用 Lucene 的 PhraseQuery 或使用更友好的 Hibernate Search DSL,

Query luceneQuery = mythQB
   .phrase()
   .onField("history")
   .matching("Thou shalt not kill")
       .createQuery();

查询 DSL 的整个文档为 此处

I don't know how to differentiate a good 2007 from a bad one.

One thing you could do is to use a analyzer that ignores numbers for description but use a regular analyzer for title. That way only numbers in the title will be picked up. In practice it's not a whole analyzer but a simple filter that you can write and add to the analyzer stack.

You can also index description twice, once ignoring numbers and once not ignoring them. You can then play with the boost factor at query time to search both fields but give the one with numbers a low priority.

Another solution is to ignore some number patterns in your custom filter (ie year-style numbers, single digits numbers etc): these would be the most common type of noisy numbers that you would want ignored (that's what I would go for first I think).

As for the phrase search, simply use a PhraseQuery by Lucene or use the more friendly Hibernate Search DSL,

Query luceneQuery = mythQB
   .phrase()
   .onField("history")
   .matching("Thou shalt not kill")
       .createQuery();

The whole doc for the query DSL is here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文