“二维搜索”在 Solr 中或如何获取多值字段“items”的最佳项目?

发布于 2024-09-04 18:08:03 字数 988 浏览 7 评论 0原文

标题有点尴尬,但我找不到更好的。我的问题如下:

我有几个用户存储为文档,并且我为每个文档存储几个键值对或项目(有一个 id)。现在,如果我使用 hl.snippets=5 应用突出显示,我可以获得前 5 项。但每个用户可能有数百个项目,因此

  • 您不会获得最相关的 5 个项目。您将获得 5 个项目...

另一个问题是

  • 突出显示的文本不会包含 id,因此检索突出显示的项目文本的附加信息很难看。

项目是电子邮件的示例:

user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
          item2 { text:"c# development",                   id:2, title:"nice!" }
          ...
          item77 ...

user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
          item2 { text:"best cafe",       id:4, title:"blup"}
          ...
          item223 ...

现在,如果我对文本字段使用突出显示并查询“restaurant”,我会得到 user2 和文本 nice restaurant。但是我如何确定要显示的突出显示文本的 ID(例如该项目的标题)?如果更多相关项目列在项目列表的末尾,会发生什么情况?突出显示不会显示这些...

那么我怎样才能找到具有多个此类项目的文档中的最佳项目呢?

我添加了我的两个发现作为答案,但正如我将指出的那样,每个发现都有其自己的缺点。

有人能给我指出更好的解决方案吗?

The title is a bit awkward but I couldn't found a better one. My problem is as follows:

I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so

  • you will not get the most relevant 5 items. You will get the first 5 items ...

Another problem is that

  • the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.

Example where items are emails:

user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
          item2 { text:"c# development",                   id:2, title:"nice!" }
          ...
          item77 ...

user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
          item2 { text:"best cafe",       id:4, title:"blup"}
          ...
          item223 ...

Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...

So how can I find the best items of a documents with multiple such items?

I added my two findings as answers, but as I will point out each of them has its own drawbacks.

Could anyone point me to a better solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

記柔刀 2024-09-11 18:08:03

我设计 Solr 模式的经验法则之一是:文档就是您要搜索的内容。

如果您想搜索“项目”,那么这些“项目”就是您的文档。如何存储其他内容(例如“用户”)是次要的。因此,“用户”可能位于另一个索引中,就像您提到的那样,它们可能被“非规范化”(例如,它们的信息在每个文档中重复),在关系数据库中等,具体取决于 RDBMS 的可用性、有多少“用户”、如何这些“用户”拥有许多字段等。

编辑:现在您解释“项目”是电子邮件,可能的搜索是“餐厅 X”,并且您想要找到最好的“项目”(电子邮件)。因此,该文档就是电子邮件。该架构可以像这样简单:(id、标题、文本、用户)。

您可以启用突出显示以获取与“restaurant X”查询匹配的“text”或“title”字段的片段。

如果您想向最终用户提供有关撰写“餐厅 X”的用户的信息,您可以对“用户”字段进行分面。然后,最终用户会看到 John 写了 10 封关于“餐厅 X”的电子邮件,而 Robert 写了 6 封电子邮件。最终用户认为“这个约翰家伙一定对这家餐厅了解很多”,因此他深入搜索“餐厅 X” ' 使用过滤器查询 user:John

One of my rules of thumb for designing Solr schemas is: the document is what you will search for.

If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.

EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).

You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.

If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John

请爱~陌生人 2024-09-11 18:08:03

您可以使用两个索引:用户->项目(如问题中所述)和带有“纯项目”引用回用户的索引。

然后你将需要 2 个查询(这就是我将问题称为“Solr 中的 2d 搜索”的原因):

  1. 查询用户索引 =>例如,10 个用户的列表
  2. 查询第 1 步中每个用户的项目索引。 用户

假设以下示例:

A 的电子邮件是“X 餐厅不好,但 X 餐厅很便宜”、“不同的主题”、“不同的主题 B”,而

用户 B 的电子邮件是“X 餐厅不好”、“重新访问了 X 餐厅,它是现在好吧”,“再次在 X 餐厅,我认为这是最好的”。

现在我查询“restaurant X”的用户索引,第一个用户将是 userB,这就是我想要的。如果我只查询 item-index,我会得到不太相关的 userA 的 item1。

缺点:

  • 性能较差,因为您需要对用户索引进行一次查询,例如再进行 10 次查询才能为每个用户获取最相关的项目。
  • 维持两个指数。

更新以避免许多查询,我将尝试以下操作:使用用户索引获取一些突出显示的片段,然后为每个用户提供“获取相关项目”按钮,然后触发针对项目索引的查询。

You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.

Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):

  1. query the user index => list of e.g. 10 users
  2. query the items index for each user of the 1. step => best items

Assume the following example:

userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and

userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".

Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.

Drawbacks:

  • bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
  • maintaining two indices.

Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.

云胡 2024-09-11 18:08:03

您可以使用折叠补丁并将每个项目存储为链接回用户的单独文档。

这种方法的问题是您不会获得最相关的用户。 IE。最相关的项目不一定来自最相关的用户(因为他可能有几个不太相关的项目)

请参阅我的第二个答案中的“假设以下示例:”部分。

You can use the collapse patch and store each item as separate document linking back to the user.

The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)

See the "Assume the following example:" part in my second answer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文