lucene分组依据
您好,有一个索引简单文档,其中有 2 个字段:
1. profileId as long
2. profileAttribute as long.
我需要知道有多少个 profileId 具有一组特定的属性。
例如我索引:
doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55
并且我想知道有多少个配置文件同时具有属性 55 和 57 在此示例中,答案是 1,因为只有配置文件 id 1 具有这两个属性,
提前感谢您的帮助
hi have index simple document where you have 2 fields:
1. profileId as long
2. profileAttribute as long.
i need to know how many profileId's have a certain set of attribute.
for example i index:
doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55
and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes
thanks in advance for your help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以搜索
profileAttribute:(57 OR 55)
,然后迭代结果并将其profileId
属性放入一个集合中,以便计算唯一profileId
。但您需要知道,与 RDBMS 相比,Lucene 在这方面的表现较差。这是因为 Lucene 是一个倒排索引,这意味着它非常擅长检索与查询匹配的顶级文档。然而,它不太擅长迭代大量文档的存储字段。
但是,如果
profileId
是单值且已建立索引,您可以使用 Lucene 的 fieldCache 获取其值,这将防止您执行昂贵的磁盘访问。缺点是该 fieldCache 将使用大量内存(取决于索引的大小),并且每次(重新)打开索引时都需要时间来加载。如果更改索引格式是可以接受的,则可以通过使
profileId
唯一,来改进此解决方案,您的索引将具有以下格式:不同之处在于
profileId
是唯一的,并且profileAttribute
现在是一个多值字段。要计算给定的profileAttribute
集的profileId
数量,您现在只需查询profileAttribute
列表(如前所述)并使用 TotalHitCountCollector。You can search for
profileAttribute:(57 OR 55)
and then iterate over the results and put theirprofileId
property in a set in order to count the total number of uniqueprofileId
s.But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.
However, if
profileId
is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.If changing the index format is acceptable, this solution can be improved by making
profileId
s uniques, your index would have the following format :The difference is that
profileId
s are unique andprofileAttribute
is now a multi-valued field. To count the number ofprofileId
s for a given set ofprofileAttribute
, you now only need to query for the list ofprofileAttribute
(as previously) and use a TotalHitCountCollector.