查找具有高交集的集合的最快算法

发布于 2024-08-30 04:01:50 字数 460 浏览 2 评论 0 原文

我有大量的用户 ID（整数），可能有数百万个。这些用户都属于不同的组（整数集），大约有 1000 万个组。

为了简化我的示例并了解其本质，我们假设所有组都包含 20 个用户 ID。

我想找到交集为 15 或更大的所有整数集对。

我应该比较每对集合吗？（如果我保留一个将 userID 映射到设置成员资格的数据结构，则不需要这样做。）最快的方法是什么？也就是说，我的底层数据结构应该是什么来表示整数集？排序集、未排序——散列能以某种方式有所帮助吗？我应该使用什么算法来计算集合交集）？我更喜欢与 C/C++（尤其是 STL）相关的答案，但也欢迎任何更一般的算法见解。

更新另外，请注意，我将在共享内存环境中并行运行它，因此首选干净地扩展到并行解决方案的想法。

另请注意，绝大多数集合对的交集大小为 0——这意味着使用将用户 ID 映射到集合的数据结构以避免计算每对集合的交集可能会更有利。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谜泪 2024-09-06 04:01:50

我会完全按照您的建议进行：将用户映射到他们的组。也就是说，我会为每个用户保留一个组 ID 列表。然后我将使用以下算法：

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

假设您有 G 个组，每个组平均包含 U 个用户，并且假设这些用户属于 g 组平均而言，这将在 O( G*U*g ) 中运行。考虑到您的问题，这可能比在 O(G*G*U) 中运行的组的简单成对比较要快得多。

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

回复收藏 0 原文

清泪尽 2024-09-06 04:01:50

如果绝大多数交集为0，则说明非空交集的数量相对较少。尝试一下：

之前，丢弃所有大小 <15 的集合
在开始根据 userid 计算查找 ->它所属的集合列表
创建一个 map, int>
对于每个用户，递增（如有必要，在创建后），n*(n-1)/该映射的 2 个条目，其中 n 是用户所属的集合的数量。
完成后，扫描地图以查找值大于 15 的条目。

与计算每个交叉点的简单方法相比，它会使用更多的内存。事实上，它会遇到可行的问题：如果每个集合平均只与其他 10 个集合相交，也许是在非常小的交叉点，那么地图需要 50M 条目，这开始需要大量的 RAM。不幸的是，它对缓存不友好。

它可能比执行所有集合交集更快，因为 O(n^2) 项与非空交集的数量和每个用户所属的组的数量相关，而不是与集合的数量相关。

由于巨大地图上的争用，并行化并不是微不足道的。但是，您可以将其分片为每个线程的映射，并定期为一个线程提供一个新的空映射，并将迄今为止的结果添加到总结果中。然后，不同的线程在大多数情况下完全独立运行，每个线程都给出一个要处理的用户列表。

回复收藏 0 原文

穿透光 2024-09-06 04:01:50

我应该比较每对集合吗？（如果我保留一个将 userID 映射到设置成员资格的数据结构，则不需要这样做。）

为了计算交叉度，您仍然需要访问用户拥有的其他组，这仍然是立方的。您可以使用哈希表或其他稀疏数组进行计数，但它最多仍然需要每个用户所在的每对组的每个用户的增量。如果您在 G 个组中有 N 个用户，平均每个组有 S 个用户组和每个用户所在的组数 T，如果您的索引为用户到组。 T = GS/N，所以 NTT=GGSS/N；对于 S=20 且 N 为数百万的情况，应该有优势。不幸的是，您还需要至少 G*G 的存储空间来存储交集计数（4 位非稀疏计数器为 25 TB 左右），并且必须确保结构可以并行递增。

对于 1000 万个每组 20 人的组中的 100 万用户，用户属于给定组的概率大约为 2e-6，两个组共享用户的概率为 40e-6，因此 25TB 缩减为 1GB数据，因此对于普通大小的计算机上的稀疏数组来说并非不可能。

然而，比较 20 个元素的集合中 15 个共同元素有更明显的优化。

如果组已排序，则不需要工作存储，只需直接输出输入组之间的差异程度。
大多数内存访问在连续的内存区域中将是线性的，并且结果仅取决于所比较的两个集合，而不是依赖于整个数据集的求和。随机访问主存储器比线性访问慢得多。使用总线锁随机更改主内存比访问缓存慢几个数量级，而无需锁定总线（尽管如果每个核心有几 GB，则可以使用用户->组方法，而无需进行任何同步）。
只需要统计集合之间不同的5个元素；如果数据是随机的，那么大多数集合将是不相交的，因此访问的元素的平均数量会较小。
您可以通过将差异视为距离来快速折扣某些组（如果 A 与 B 相差 11，C 与 B 相差 5，则 C 与 A 相差 6 到 16 之间，因此无需比较 A 和 C 即可折扣直接地）。由于大多数集合是完全不相交的，因此这不会给你带来太多好处。

还存在混合方法的选项，使用用户→组映射来修剪要进行的组与组比较的集合。这样做的优点是不需要增加共享数据结构：

对于用户所在的每对组，将该对添加到列表中以进行调查。
对至少有一个共同用户的组对列表进行排序。
每对在列表中出现的次数就是它们共同的用户数。

使用合并排序，这非常适合并行化为纯流单元。您将排序大约 20*200*1000 万/2 = 200 亿组 ID 对（每组 20 个用户乘以每个用户所在组的数量/2）。

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)

In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.

For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.

However, comparing set of 20 elements for 15 elements in common has more obvious optimisations

If the groups are sorted, you require no working storage, just output the degree of difference between the input groups directly.
Most of the memory accesses will be linear in contiguous memory areas, and the results are dependent only on the two sets being compared, rather than relying on summing across the whole data set. Accessing main memory randomly is significantly slower than accessing it linearly. Altering main memory randomly with bus locks is orders of magnitude slower than accessing cache without the need to lock the bus (though if you have a couple of GB per core you can use the user->group approach without having to do any synchronisation).
Only need to count 5 elements which are distinct between the sets; if the data is random then most sets will be disjoint, so the average number of elements visited is smaller.
You can quickly discount certain groups by treating the difference as a distance (if A is 11 different from B, and C is 5 different from B, then C is between 6 and 16 different from A, so can be discounted without comparing A and C directly). Since most the sets are completely disjoint, this won't gain you much though.

There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:

for each pair of groups that a user is in, add that pair to the list to investigate.
sort the list of pairs of groups with at least one user in common.
the number of times each pair occurs in the list is the number of users they have in common.

Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).

回复收藏 0 原文