lucene分组依据

发布于 2024-12-03 12:27:56 字数 403 浏览 0 评论 0原文

您好,有一个索引简单文档,其中有 2 个字段:

1. profileId as long

2. profileAttribute as long.

我需要知道有多少个 profileId 具有一组特定的属性。

例如我索引:

doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55

并且我想知道有多少个配置文件同时具有属性 55 和 57 在此示例中,答案是 1,因为只有配置文件 id 1 具有这两个属性,

提前感谢您的帮助

hi have index simple document where you have 2 fields:

1. profileId as long

2. profileAttribute as long.

i need to know how many profileId's have a certain set of attribute.

for example i index:

doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55

and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes

thanks in advance for your help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

单身情人 2024-12-10 12:27:56

您可以搜索 profileAttribute:(57 OR 55),然后迭代结果并将其 profileId 属性放入一个集合中,以便计算唯一 profileId

但您需要知道,与 RDBMS 相比,Lucene 在这方面的表现较差。这是因为 Lucene 是一个倒排索引,这意味着它非常擅长检索与查询匹配的顶级文档。然而,它不太擅长迭代大量文档的存储字段。

但是,如果 profileId 是单值且已建立索引,您可以使用 Lucene 的 fieldCache 获取其值,这将防止您执行昂贵的磁盘访问。缺点是该 fieldCache 将使用大量内存(取决于索引的大小),并且每次(重新)打开索引时都需要时间来加载。

如果更改索引格式是可以接受的,则可以通过使 profileId 唯一,来改进此解决方案,您的索引将具有以下格式:

doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]

不同之处在于 profileId 是唯一的,并且profileAttribute 现在是一个多值字段。要计算给定的 profileAttribute 集的 profileId 数量,您现在只需查询 profileAttribute 列表(如前所述)并使用 TotalHitCountCollector

You can search for profileAttribute:(57 OR 55) and then iterate over the results and put their profileId property in a set in order to count the total number of unique profileIds.

But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.

However, if profileId is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.

If changing the index format is acceptable, this solution can be improved by making profileIds uniques, your index would have the following format :

doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]

The difference is that profileIds are unique and profileAttribute is now a multi-valued field. To count the number of profileIds for a given set of profileAttribute, you now only need to query for the list of profileAttribute (as previously) and use a TotalHitCountCollector.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文