lucene分组依据

发布于 2024-12-03 12:27:56 字数 403 浏览 2 评论 0原文

您好，有一个索引简单文档，其中有 2 个字段：

1. profileId as long

2. profileAttribute as long.

我需要知道有多少个 profileId 具有一组特定的属性。

例如我索引：

doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55

并且我想知道有多少个配置文件同时具有属性 55 和 57 在此示例中，答案是 1，因为只有配置文件 id 1 具有这两个属性，

提前感谢您的帮助

原文

hi have index simple document where you have 2 fields:

1. profileId as long

2. profileAttribute as long.

i need to know how many profileId's have a certain set of attribute.

for example i index:

doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55

and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes

thanks in advance for your help

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

单身情人 2024-12-10 12:27:56

您可以搜索 profileAttribute:(57 OR 55)，然后迭代结果并将其 profileId 属性放入一个集合中，以便计算唯一 profileId。

但您需要知道，与 RDBMS 相比，Lucene 在这方面的表现较差。这是因为 Lucene 是一个倒排索引，这意味着它非常擅长检索与查询匹配的顶级文档。然而，它不太擅长迭代大量文档的存储字段。

但是，如果 profileId 是单值且已建立索引，您可以使用 Lucene 的 fieldCache 获取其值，这将防止您执行昂贵的磁盘访问。缺点是该 fieldCache 将使用大量内存（取决于索引的大小），并且每次（重新）打开索引时都需要时间来加载。

如果更改索引格式是可以接受的，则可以通过使 profileId 唯一，来改进此解决方案，您的索引将具有以下格式：

doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]

不同之处在于 profileId 是唯一的，并且profileAttribute 现在是一个多值字段。要计算给定的 profileAttribute 集的 profileId 数量，您现在只需查询 profileAttribute 列表（如前所述）并使用 TotalHitCountCollector。

You can search for profileAttribute:(57 OR 55) and then iterate over the results and put their profileId property in a set in order to count the total number of unique profileIds.

But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.

However, if profileId is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.

If changing the index format is acceptable, this solution can be improved by making profileIds uniques, your index would have the following format :

doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]

The difference is that profileIds are unique and profileAttribute is now a multi-valued field. To count the number of profileIds for a given set of profileAttribute, you now only need to query for the list of profileAttribute (as previously) and use a TotalHitCountCollector.

回复收藏 0 原文

~没有更多了~