基于集合的三元组的高效匹配算法
我正在寻找一种有效的方法来解决以下问题。
列表 1 是由原始三元组标识的记录列表:
X | Y | Z
列表 2 是由三个集合标识的记录列表。一个 X,一个 Y,一个 Z。 X、Y、Z 与列表一中的 X、Y、Z 具有相同的“类型”,因此可以直接相互比较。
Set(X) | Set(Y) | Set(Z)
对于列表 1 中的某个项目,我需要查找列表 2 中的所有项目,其中列表 1 中的 X、Y、Z 全部出现在列表 2 中相应的集合中。这可以通过示例进行最佳演示:
列表 1:
X1, Y1, Z1
列表 2:
(X1, X2) | (Y1) | (Z1, Z3)
(X1) | (Y1, Y2) | (Z1, Z2, Z3)
(X3) | (Y1, Y3) | (Z2, Z3)
在上面,列表 1 中的项目将匹配列表 2 中的前两项。第三个项目将不会匹配,因为 X1 没有出现在 X 集合中,并且 Z1 不会出现在 Z 集合中。
我已经编写了该算法的功能正确版本,但担心较大数据集上的性能。两个列表都非常大,因此迭代列表 1,然后对列表 2 的每个项目执行迭代将非常低效。
我尝试通过将列表 2 中的每个项目反规范化为一个映射来构建索引,但每个项目的索引中的索引条目数与该项目子集的大小成正比。因此,这使用了非常高的内存级别,并且还需要一些重要的资源来构建。
谁能向我建议解决这个问题的最佳方法。我很乐意考虑内存和 CPU 的最佳解决方案,但取得平衡会很好!
I am looking for an efficient way to solve the following problem.
List 1 is a list of records that are identified by a primitive triplet:
X | Y | Z
List 2 is a list of records that are identified by three sets. One Xs, one Ys, one Zs. The X, Y, Zs are of the same 'type' as those in list one so are directly comparable with one another.
Set(X) | Set(Y) | Set(Z)
For an item in list 1 I need to find all the items in list 2 where the X, Y, Z from list 1 all occur in their corresponding sets in list 2. This is best demonstrated by an example:
List 1:
X1, Y1, Z1
List 2:
(X1, X2) | (Y1) | (Z1, Z3)
(X1) | (Y1, Y2) | (Z1, Z2, Z3)
(X3) | (Y1, Y3) | (Z2, Z3)
In the above, the item in list 1 would match the first two items in list 2. The third item would not be matched as X1 does not occur in the X set, and Z1 does not occur in the Z set.
I have written a functionally correct version of the algorithm but am concerned about performance on larger data sets. Both lists are very large so iterating over list 1 and then performing an iteration over list 2 per item is going to be very inefficient.
I tried to build an index by de-normalizing each item in list 2 into a map, but the number of index entries in the index per item is proportional to the size of the item's subsets. As such this uses a very high level of memory and also requires some significant resource to build.
Can anyone suggest to me an optimal way of solving this. I'm happy to consider both memory and CPU optimal solutions but striking a balance would be nice!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果集合的总大小不太大,您可以尝试将列表 2 建模为位字段。不过,该结构可能会非常分散 - 可能是维基百科文章中关于位数组中引用的结构(Judy 数组、tries、布隆过滤器)可以帮助解决规范化方法的内存问题。
If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.
您可以从 List2 构建一棵树;树的第一层是集合 X 中出现的 (X1..Xn) 的第一个。第二层是第二项的值,加上包含仅包含 X1 的列表集的叶节点。下一个级别包含下一个可能的值,依此类推。
这在内存消耗上是昂贵的(我认为 N^2 log K?其中 N=X 的值,K=List2 中的行),但会导致快速检索时间。如果可能的 X 数量很大,那么这种方法就会失败......
显然,您可以为元组的所有 3 个部分构建此索引,然后将搜索每棵树的结果 AND 在一起。
You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.
This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...
Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.
有一种相当有效的方法可以通过单次传递 list2 来完成此操作。首先构建 list1 中项目的索引。
这里额外的簿记可能会使其比索引 list2 慢。但由于在您的情况下 list1 通常比 list2 小得多,因此这将使用更少的内存。如果您从磁盘读取 list2,则使用此算法您永远不需要将其任何部分保留在内存中。
内存访问可能是一件大事,所以我不能肯定地说哪个在实践中会更快。必须测量。除非哈希表出现故障,这两种情况下最坏情况的时间复杂度均为 O(len(list1)*len(list2))。
There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.
The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.
Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).
对于 List 2 使用
HashSet
(或HashSet
s)怎么样?这样您只需要迭代列表 1How about using
HashSet
(orHashSet
s) for List 2 ? This way you will only need to iterate over List 1如果您使用 Guava,则有一种高级方法可以实现此目的,但不一定是 最佳但不会做任何疯狂的事情:
但检查这个“常规”也不难。
If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:
But it's not that hard to check this "longhand" either.
将会有很多方法来解决这个问题。哪个是正确的取决于数据和可用内存的大小。
一种简单的技术是从 list2 构建一个表,以加速来自 list1 的查询。
There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.
One simple technique is to build a table from list2, to accelerate the queries coming from list1.