查找具有高交集的集合的最快算法

发布于 2024-08-30 04:01:50 字数 460 浏览 2 评论 0 原文

我有大量的用户 ID(整数),可能有数百万个。这些用户都属于不同的组(整数集),大约有 1000 万个组。

为了简化我的示例并了解其本质,我们假设所有组都包含 20 个用户 ID。

我想找到交集为 15 或更大的所有整数集对。

我应该比较每对集合吗? (如果我保留一个将 userID 映射到设置成员资格的数据结构,则不需要这样做。)最快的方法是什么?也就是说,我的底层数据结构应该是什么来表示整数集?排序集、未排序——散列能以某种方式有所帮助吗?我应该使用什么算法来计算集合交集)?我更喜欢与 C/C++(尤其是 STL)相关的答案,但也欢迎任何更一般的算法见解。

更新 另外,请注意,我将在共享内存环境中并行运行它,因此首选干净地扩展到并行解决方案的想法。

另请注意,绝大多数集合对的交集大小为 0——这意味着使用将用户 ID 映射到集合的数据结构以避免计算每对集合的交集可能会更有利。

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.

I want to find all pairs of integer sets that have an intersection of 15 or greater.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

Update
Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谜泪 2024-09-06 04:01:50

我会完全按照您的建议进行:将用户映射到他们的组。也就是说,我会为每个用户保留一个组 ID 列表。然后我将使用以下算法:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

假设您有 G 个组,每个组平均包含 U 个用户,并且假设这些用户属于 g 组平均而言,这将在 O( G*U*g ) 中运行。考虑到您的问题,这可能比在 O(G*G*U) 中运行的组的简单成对比较要快得多。

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

清泪尽 2024-09-06 04:01:50

如果绝大多数交集为0,则说明非空交集的数量相对较少。尝试一下:

  • 之前,丢弃所有大小 <15 的集合
  • 在开始根据 userid 计算查找 ->它所属的集合列表
  • 创建一个 map, int>
  • 对于每个用户,递增(如有必要,在创建后),n*(n-1)/该映射的 2 个条目,其中 n 是用户所属的集合的数量。
  • 完成后,扫描地图以查找值大于 15 的条目。

与计算每个交叉点的简单方法相比,它会使用更多的内存。事实上,它会遇到可行的问题:如果每个集合平均只与其他 10 个集合相交,也许是在非常小的交叉点,那么地图需要 50M 条目,这开始需要大量的 RAM。不幸的是,它对缓存不友好。

它可能比执行所有集合交集更快,因为 O(n^2) 项与非空交集的数量和每个用户所属的组的数量相关,而不是与集合的数量相关。

由于巨大地图上的争用,并行化并不是微不足道的。但是,您可以将其分片为每个线程的映射,并定期为一个线程提供一个新的空映射,并将迄今为止的结果添加到总结果中。然后,不同的线程在大多数情况下完全独立运行,每个线程都给出一个要处理的用户列表。

If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:

  • Throw away all sets of size <15 before you start
  • Calculate your lookup from userid -> list of sets to which it belongs
  • Create a map<pair<userset, userset>, int>
  • For each user, increment (after creating if necessary), n*(n-1)/2 entries of that map, where n is the number of sets to which the user belongs.
  • When that's finished, scan the map for entries where the value is greater than 15.

It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.

It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.

Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.

穿透光 2024-09-06 04:01:50

我应该比较每对集合吗? (如果我保留一个将 userID 映射到设置成员资格的数据结构,则不需要这样做。)

为了计算交叉度,您仍然需要访问用户拥有的其他组,这仍然是立方的。您可以使用哈希表或其他稀疏数组进行计数,但它最多仍然需要每个用户所在的每对组的每个用户的增量。如果您在 G 个组中有 N 个用户,平均每个组有 S 个用户组和每个用户所在的组数 T,如果您的索引为用户到组。 T = GS/N,所以 NTT=GGSS/N;对于 S=20 且 N 为数百万的情况,应该有优势。不幸的是,您还需要至少 G*G 的存储空间来存储交集计数(4 位非稀疏计数器为 25 TB 左右),并且必须确保结构可以并行递增。

对于 1000 万个每组 20 人的组中的 100 万用户,用户属于给定组的概率大约为 2e-6,两个组共享用户的概率为 40e-6,因此 25TB 缩减为 1GB数据,因此对于普通大小的计算机上的稀疏数组来说并非不可能。

然而,比较 20 个元素的集合中 15 个共同元素有更明显的优化。

  • 如果组已排序,则不需要工作存储,只需直接输出输入组之间的差异程度。
  • 大多数内存访问在连续的内存区域中将是线性的,并且结果仅取决于所比较的两个集合,而不是依赖于整个数据集的求和。随机访问主存储器比线性访问慢得多。使用总线锁随机更改主内存比访问缓存慢几个数量级,而无需锁定总线(尽管如果每个核心有几 GB,则可以使用用户->组方法,而无需进行任何同步) 。
  • 只需要统计集合之间不同的5个元素;如果数据是随机的,那么大多数集合将是不相交的,因此访问的元素的平均数量会较小。
  • 您可以通过将差异视为距离来快速折扣某些组(如果 A 与 B 相差 11,C 与 B 相差 5,则 C 与 A 相差 6 到 16 之间,因此无需比较 A 和 C 即可折扣直接地)。由于大多数集合是完全不相交的,因此这不会给你带来太多好处。

还存在混合方法的选项,使用用户→组映射来修剪要进行的组与组比较的集合。这样做的优点是不需要增加共享数据结构:

  • 对于用户所在的每对组,将该对添加到列表中以进行调查。
  • 对至少有一个共同用户的组对列表进行排序。
  • 每对在列表中出现的次数就是它们共同的用户数。

使用合并排序,这非常适合并行化为纯流单元。您将排序大约 20*200*1000 万/2 = 200 亿组 ID 对(每组 20 个用户乘以每个用户所在组的数量/2)。

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)

In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.

For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.

However, comparing set of 20 elements for 15 elements in common has more obvious optimisations

  • If the groups are sorted, you require no working storage, just output the degree of difference between the input groups directly.
  • Most of the memory accesses will be linear in contiguous memory areas, and the results are dependent only on the two sets being compared, rather than relying on summing across the whole data set. Accessing main memory randomly is significantly slower than accessing it linearly. Altering main memory randomly with bus locks is orders of magnitude slower than accessing cache without the need to lock the bus (though if you have a couple of GB per core you can use the user->group approach without having to do any synchronisation).
  • Only need to count 5 elements which are distinct between the sets; if the data is random then most sets will be disjoint, so the average number of elements visited is smaller.
  • You can quickly discount certain groups by treating the difference as a distance (if A is 11 different from B, and C is 5 different from B, then C is between 6 and 16 different from A, so can be discounted without comparing A and C directly). Since most the sets are completely disjoint, this won't gain you much though.

There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:

  • for each pair of groups that a user is in, add that pair to the list to investigate.
  • sort the list of pairs of groups with at least one user in common.
  • the number of times each pair occurs in the list is the number of users they have in common.

Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).

忆离笙 2024-09-06 04:01:50

一种方法是将您的问题视为度量空间 半径搜索距离函数是不匹配条目数且半径为 r = max(集合中元素数) - 相等数的问题。需要对找到的元素进行过滤,以确保集合中有足够的值。因此,除非有人提出可以直接使用的度量函数,否则该解决方案有很多限制。

度量搜索的数据结构之一是 BK-Tree,可用于字符串相似度搜索。

您的问题的候选者是VP树和M树。

当您搜索距离 > 时,度量树的最坏情况是 O(n^2)。 m(集合中元素的最大数量),当您在 O(log n * n) 中构建树并在 O(n^2) 中搜索时。

除此之外,实际的运行时复杂性取决于执行搜索时修剪度量树子树的能力。在度量树中,如果枢轴元素到搜索元素的距离大于枢轴元素的半径(至少是祖先到枢轴元素的最大距离),则可以跳过子树。如果您的条目集相当分离,则总体运行时间将由度量树 O(log n * n) 的构建时间主导。

One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

Candidates for your problem are VP-tree and M-Trees.

The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文