Python:快速提取大量列表中所有可能的2组合之间的交集

发布于 2024-08-11 21:16:10 字数 238 浏览 2 评论 0原文

我有一个大约的数据集。 9K 可变长度列表(1 到 100K 元素)。我需要计算此数据集中所有可能的 2 列表组合的交集长度。请注意,每个列表中的元素都是唯一的,因此它们可以在 python 中存储为集合。

在 python 中执行此操作最有效的方法是什么?

编辑 我忘记指定我需要能够将交集值与相应的列表对进行匹配。感谢大家的及时回复并对造成的困惑表示歉意!

I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations in this dataset. Note that elements in each list are unique so they can be stored as sets in python.

What is the most efficient way to perform this in python?

Edit I forgot to specify that I need to have the ability to match the intersection values to the corresponding pair of lists. Thanks everybody for the prompt response and apologies for the confusion!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吻泪 2024-08-18 21:16:10

如果您的集合存储在 s 中,例如:

s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]

那么您可以使用 itertools.combinations 将它们两两并计算交集(请注意,正如 Alex 指出的,combinations 仅自版本 2.6 起可用)。这里有一个列表理解(只是为了示例):

from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]

或者,在一个循环中,这可能就是您所需要的:

for i in combinations(s, 2):
    inter = i[0] & i[1]
    # processes the intersection set result "inter"

因此,要获得其中每个的长度,“处理”将是:

    l = len(inter)

这将是非常高效,因为它使用迭代器来计算每个组合,并且不会提前准备所有组合。


编辑:请注意,使用此方法,列表“s”中的每个集合实际上可以是返回集合的其他内容,例如生成器。如果您的内存不足,列表本身可以只是一个生成器。不过,它可能会慢得多,具体取决于您生成这些元素的方式,但您不需要同时将整个集合列表存储在内存中(这在您的情况下并不是一个问题)。

例如,如果每个集合都是由函数gen组成:

def gen(parameter):
    while more_sets():
        # ... some code to generate the next set 'x'
        yield x

with open("results", "wt") as f_results:
    for i in combinations(gen("data"), 2):
        inter = i[0] & i[1]
        f_results.write("%d\n" % len(inter))

编辑2:如何收集索引(遵循redrat的评论)。

除了我在评论中回答的快速解决方案之外,收集集合索引的更有效方法是使用 (index, set) 列表,而不是 set 列表。

新格式的示例:

s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]

如果您构建此列表是为了计算组合,那么它应该很容易适应您的新要求。主循环变为:

with open("results", "wt") as f_results:
    for i in combinations(s, 2):
        inter = i[0][1] & i[1][1]
        f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))

在循环中,i[0]i[1] 将是一个元组 (index, set),因此i[0][1] 是第一个集合,i[0][0] 是它的索引。

If your sets are stored in s, for example:

s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]

Then you can use itertools.combinations to take them two by two, and calculate the intersection (note that, as Alex pointed out, combinations is only available since version 2.6). Here with a list comrehension (just for the sake of the example):

from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]

Or, in a loop, which is probably what you need:

for i in combinations(s, 2):
    inter = i[0] & i[1]
    # processes the intersection set result "inter"

So, to have the length of each one of them, that "processing" would be:

    l = len(inter)

This would be quite efficient, since it's using iterators to compute every combinations, and does not prepare all of them in advance.


Edit: Note that with this method, each set in the list "s" can actually be something else that returns a set, like a generator. The list itself could simply be a generator if you are short on memory. It could be much slower though, depending on how you generate these elements, but you wouldn't need to have the whole list of sets in memory at the same time (not that it should be a problem in your case).

For example, if each set is made from a function gen:

def gen(parameter):
    while more_sets():
        # ... some code to generate the next set 'x'
        yield x

with open("results", "wt") as f_results:
    for i in combinations(gen("data"), 2):
        inter = i[0] & i[1]
        f_results.write("%d\n" % len(inter))

Edit 2: How to collect indices (following redrat's comment).

Besides the quick solution I answered in comment, a more efficient way to collect the set indices would be to have a list of (index, set) instead of a list of set.

Example with new format:

s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]

If you are building this list to calculate the combinations anyway, it should be simple to adapt to your new requirements. The main loop becomes:

with open("results", "wt") as f_results:
    for i in combinations(s, 2):
        inter = i[0][1] & i[1][1]
        f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))

In the loop, i[0] and i[1] would be a tuple (index, set), so i[0][1] is the first set, i[0][0] its index.

雄赳赳气昂昂 2024-08-18 21:16:10

由于您需要生成一个(N × N/2)结果矩阵,即 O(N 平方) 输出,因此任何方法都不能小于 O(N 平方)——当然,在任何语言中。 (N在你的问题中是“大约9K”)。因此,我认为本质上没有什么比 (a) 制作所需的 N 个集合,以及 (b) 迭代它们以产生输出更快的方法了——即最简单的方法。 IOW:

def lotsofintersections(manylists):
  manysets = [set(x) for x in manylists]
  moresets = list(manysets)
  for  s in reversed(manysets):
    moresets.pop()
    for z in moresets:
      yield s & z

此代码已经尝试添加一些小的优化(例如,通过避免切片或弹出列表的前面,这可能会添加其他 O(N 平方) 因子)。

如果您有许多可用的核心和/或节点,并且正在寻找并行算法,那么当然情况不同 - 如果这是您的情况,您能否提及您拥有的集群类型、其大小、节点和核心如何最好地通信,等等?

编辑:正如OP在评论中随意提到的那样(!),他们实际上需要相交的集合的数量(真的,为什么要省略规范中如此重要的部分?!至少编辑问题为了澄清它们...),这只需要将其更改为:(

  L = len(manysets)
  for i, s in enumerate(reversed(manysets)):
    moresets.pop()
    for j, z in enumerate(moresets):
      yield L - i, j + 1, s & z

如果您需要“从 1 开始计数”作为渐进标识符 - 否则是明显的更改)。

但如果这是规范的一部分,您不妨使用更简单的代码——忘记更多集合,并且:

  L = len(manysets)
  for i xrange(L):
    s = manysets[i]
    for j in range(i+1, L):
      yield i, j, s & manysets[z]

这次假设您想“从 0 开始计数”,只是为了多样性;-)

As you need to produce a (N by N/2) matrix of results, i.e., O(N squared) outputs, no approach can be less than O(N squared) -- in any language, of course. (N is "about 9K" in your question). So, I see nothing intrinsically faster than (a) making the N sets you need, and (b) iterating over them to produce the output -- i.e., the simplest approach. IOW:

def lotsofintersections(manylists):
  manysets = [set(x) for x in manylists]
  moresets = list(manysets)
  for  s in reversed(manysets):
    moresets.pop()
    for z in moresets:
      yield s & z

This code's already trying to add some minor optimization (e.g. by avoiding slicing or popping off the front of lists, which might add other O(N squared) factors).

If you have many cores and/or nodes available and are looking for parallel algorithms, it's a different case of course -- if that's your case, can you mention the kind of cluster you have, its size, how nodes and cores can best communicate, and so forth?

Edit: as the OP has casually mentioned in a comment (!) that they actually need the numbers of the sets being intersected (really, why omit such crucial parts of the specs?! at least edit the question to clarify them...), this would only require changing this to:

  L = len(manysets)
  for i, s in enumerate(reversed(manysets)):
    moresets.pop()
    for j, z in enumerate(moresets):
      yield L - i, j + 1, s & z

(if you need to "count from 1" for the progressive identifiers -- otherwise obvious change).

But if that's part of the specs you might as well use simpler code -- forget moresets, and:

  L = len(manysets)
  for i xrange(L):
    s = manysets[i]
    for j in range(i+1, L):
      yield i, j, s & manysets[z]

this time assuming you want to "count from 0" instead, just for variety;-)

枉心 2024-08-18 21:16:10

试试这个:

_lists = [[1, 2, 3, 7], [1, 3], [1, 2, 3], [1, 3, 4, 7]]
_sets = map( set, _lists )
_intersection = reduce( set.intersection, _sets )

并获取索引:

_idxs = [ map(_i.index, _intersection ) for _i in _lists ]

干杯,

何塞·玛丽亚·加西亚

PS:抱歉我误解了这个问题

Try this:

_lists = [[1, 2, 3, 7], [1, 3], [1, 2, 3], [1, 3, 4, 7]]
_sets = map( set, _lists )
_intersection = reduce( set.intersection, _sets )

And to obtain the indexes:

_idxs = [ map(_i.index, _intersection ) for _i in _lists ]

Cheers,

José María García

PS: Sorry I misunderstood the question

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文