我有大量的用户 ID(整数),可能有数百万个。这些用户都属于不同的组(整数集),大约有 1000 万个组。
为了简化我的示例并了解其本质,我们假设所有组都包含 20 个用户 ID。
我想找到交集为 15 或更大的所有整数集对。
我应该比较每对集合吗? (如果我保留一个将 userID 映射到设置成员资格的数据结构,则不需要这样做。)最快的方法是什么?也就是说,我的底层数据结构应该是什么来表示整数集?排序集、未排序——散列能以某种方式有所帮助吗?我应该使用什么算法来计算集合交集)?我更喜欢与 C/C++(尤其是 STL)相关的答案,但也欢迎任何更一般的算法见解。
更新
另外,请注意,我将在共享内存环境中并行运行它,因此首选干净地扩展到并行解决方案的想法。
另请注意,绝大多数集合对的交集大小为 0——这意味着使用将用户 ID 映射到集合的数据结构以避免计算每对集合的交集可能会更有利。
I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.
To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.
I want to find all pairs of integer sets that have an intersection of 15 or greater.
Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.
Update
Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.
Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.
发布评论
评论(4)
我会完全按照您的建议进行:将用户映射到他们的组。也就是说,我会为每个用户保留一个组 ID 列表。然后我将使用以下算法:
假设您有
G
个组,每个组平均包含U
个用户,并且假设这些用户属于g
组平均而言,这将在O( G*U*g )
中运行。考虑到您的问题,这可能比在O(G*G*U)
中运行的组的简单成对比较要快得多。I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:
Given you have
G
groups each containingU
users on average, and given that these users belong tog
groups on average, then this will run inO( G*U*g )
. Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs inO(G*G*U)
.如果绝大多数交集为0,则说明非空交集的数量相对较少。尝试一下:
map, int>
n*(n-1)/该映射的 2 个条目,其中 n 是用户所属的集合的数量。
与计算每个交叉点的简单方法相比,它会使用更多的内存。事实上,它会遇到可行的问题:如果每个集合平均只与其他 10 个集合相交,也许是在非常小的交叉点,那么地图需要 50M 条目,这开始需要大量的 RAM。不幸的是,它对缓存不友好。
它可能比执行所有集合交集更快,因为 O(n^2) 项与非空交集的数量和每个用户所属的组的数量相关,而不是与集合的数量相关。
由于巨大地图上的争用,并行化并不是微不足道的。但是,您可以将其分片为每个线程的映射,并定期为一个线程提供一个新的空映射,并将迄今为止的结果添加到总结果中。然后,不同的线程在大多数情况下完全独立运行,每个线程都给出一个要处理的用户列表。
If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:
map<pair<userset, userset>, int>
n*(n-1)/2
entries of that map, where n is the number of sets to which the user belongs.It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.
It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.
Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.
为了计算交叉度,您仍然需要访问用户拥有的其他组,这仍然是立方的。您可以使用哈希表或其他稀疏数组进行计数,但它最多仍然需要每个用户所在的每对组的每个用户的增量。如果您在 G 个组中有 N 个用户,平均每个组有 S 个用户组和每个用户所在的组数 T,如果您的索引为用户到组。 T = GS/N,所以 NTT=GGSS/N;对于 S=20 且 N 为数百万的情况,应该有优势。不幸的是,您还需要至少 G*G 的存储空间来存储交集计数(4 位非稀疏计数器为 25 TB 左右),并且必须确保结构可以并行递增。
对于 1000 万个每组 20 人的组中的 100 万用户,用户属于给定组的概率大约为 2e-6,两个组共享用户的概率为 40e-6,因此 25TB 缩减为 1GB数据,因此对于普通大小的计算机上的稀疏数组来说并非不可能。
然而,比较 20 个元素的集合中 15 个共同元素有更明显的优化。
还存在混合方法的选项,使用用户→组映射来修剪要进行的组与组比较的集合。这样做的优点是不需要增加共享数据结构:
使用合并排序,这非常适合并行化为纯流单元。您将排序大约 20*200*1000 万/2 = 200 亿组 ID 对(每组 20 个用户乘以每个用户所在组的数量/2)。
In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.
For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.
However, comparing set of 20 elements for 15 elements in common has more obvious optimisations
There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:
Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).
一种方法是将您的问题视为度量空间 半径搜索距离函数是不匹配条目数且半径为 r = max(集合中元素数) - 相等数的问题。需要对找到的元素进行过滤,以确保集合中有足够的值。因此,除非有人提出可以直接使用的度量函数,否则该解决方案有很多限制。
度量搜索的数据结构之一是 BK-Tree,可用于字符串相似度搜索。
您的问题的候选者是VP树和M树。
当您搜索距离 > 时,度量树的最坏情况是 O(n^2)。 m(集合中元素的最大数量),当您在 O(log n * n) 中构建树并在 O(n^2) 中搜索时。
除此之外,实际的运行时复杂性取决于执行搜索时修剪度量树子树的能力。在度量树中,如果枢轴元素到搜索元素的距离大于枢轴元素的半径(至少是祖先到枢轴元素的最大距离),则可以跳过子树。如果您的条目集相当分离,则总体运行时间将由度量树 O(log n * n) 的构建时间主导。
One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is
r = max(number of elements in sets) - number of equal
. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.
Candidates for your problem are VP-tree and M-Trees.
The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).
Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).