测试列表是否共享 python 中的任何项目

发布于 2024-09-07 21:21:43 字数 563 浏览 5 评论 0原文

我想检查一个列表中的任何项目是否存在于另一个列表中。我可以简单地使用下面的代码来完成此操作,但我怀疑可能有一个库函数可以执行此操作。如果没有,是否有一种更Pythonic的方法可以达到相同的结果。

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

半衬遮猫 2024-09-14 21:21:43

简短回答:使用not set(a).isdisjoint(b),它通常是最快的。

有四种常见方法可以测试两个列表 ab 是否共享任何项目。第一个选项是将两者都转换为集合并检查它们的交集,如下所示:

bool(set(a) & set(b))

因为集合在 Python 中使用哈希表存储,所以搜索它们的时间为 O(1) (有关 Python 中运算符复杂性的更多信息,请参阅此处。理论上,对于列表 anm 对象来说,平均时间为 O(n+m)代码>b。但

  1. 它必须首先从列表中创建集合,这可能需要花费不可忽略的时间,并且
  2. 它假设数据中的哈希冲突很少。

第二种方法是使用生成器表达式在列表上执行迭代,例如:

any(i in a for i in b)

这允许就地搜索,因此不会为中间变量分配新内存。它还对第一个发现进行了救助。 但是 in 运算符在列表上始终是O(n)(请参阅此处)。

另一个建议的选项是混合遍历列表中的一个,转换集合中的另一个列表并测试该集合的成员资格,如下所示:

a = set(a); any(i in a for i in b)

第四种方法是利用 isdisjoint() 的优势(冻结)集的方法(请参阅此处),示例:

not set(a).isdisjoint(b)

如果您搜索的元素靠近数组的开头(例如,它已排序),则倾向于使用生成器表达式,因为集合交集方法必须为中间变量分配新的内存:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

这是一个执行时间的图表此示例是列表大小的函数:

开始共享时元素共享测试执行时间

注意,两个轴都是对数。这代表了生成器表达式的最佳情况。可以看出,isdisjoint() 方法更适合非常小的列表大小,而生成器表达式更适合较大的列表大小。

另一方面,当搜索从混合表达式和生成器表达式的开头开始时,如果共享元素系统地位于数组的末尾(或者两个列表不共享任何值),则不相交和集合交集方法将是比生成器表达式和混合方法快得多。

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

元素共享测试最后共享时的执行时间

有趣的是,对于较大的列表大小,生成器表达式要慢得多。这仅适用于 1000 次重复,而不是上图的 100000 次。当没有共享元素时,此设置也可以很好地近似,并且是不相交和集合交集方法的最佳情况。

以下是使用随机数的两个分析(而不是操纵设置以支持一种或另一种技术):

具有高共享机会的随机生成数据的元素共享测试执行时间
元素共享测试具有高共享机会的随机生成数据的执行时间

高共享机会:元素是从[1, 2*len(a)]中随机获取的。共享机会低:元素是从 [1, 1000*len(a)] 中随机取出的。

到目前为止,该分析假设两个列表的大小相同。如果两个列表大小不同,例如 a 要小得多,isdisjoint() 总是更快:

在开始时共享时,元素在两个不同大小的列表上共享测试执行时间
元素共享测试最后共享时两个不同大小的列表的执行时间

确保 a 列表较小,否则性能会降低。在此实验中,a 列表大小常量设置为 5

总之:

  • 如果列表非常小(< 10 个元素),not set(a).isdisjoint(b) 始终是最快的。
  • 如果列表中的元素已排序或具有可以利用的常规结构,则生成器表达式 any(i in a for i in b) 在大型列表大小上是最快的;
  • 使用 not set(a).isdisjoint(b) 测试集合交集,它总是比 bool(set(a) & set(b)) 更快。
  • 混合“遍历列表,测试集合”a = set(a); any(i in a for i in b) 通常比其他方法慢。
  • 当涉及到不共享元素的列表时,生成器表达式和混合方法比其他两种方法慢得多。

在大多数情况下,使用 isdisjoint() 方法是最好的方法,因为生成器表达式的执行时间会更长,而且在没有共享元素时效率非常低。

Short answer: use not set(a).isdisjoint(b), it's generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But

  1. it must first create sets out of the lists, which can take a non-negligible amount of time, and
  2. it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here's a graph of the execution time for this example in function of list size:

Element sharing test execution time when shared at the beginning

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

Element sharing test execution time when shared at the end

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

Element sharing test execution time for randomly generated data with high chance of sharing
Element sharing test execution time for randomly generated data with high chance of sharing

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Element sharing test execution time on two differently sized lists when shared at the beginning
Element sharing test execution time on two differently sized lists when shared at the end

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

  • If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

﹂绝世的画 2024-09-14 21:21:43
def lists_overlap3(a, b):
    return bool(set(a) & set(b))

注意:上面假设您想​​要一个布尔值作为答案。如果您需要的只是在 if 语句中使用的表达式,只需使用 if set(a) &设置(b):

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):

梦里南柯 2024-09-14 21:21:43
def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

这是渐近最优的(最坏情况 O(n + m)),并且由于 any 的短路,可能比交集方法更好。

例如:

lists_overlap([3,4,5], [1,2,3])

一旦到达 3 in sb 就会返回 True

编辑:另一个变体(感谢 Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

这依赖于 imap 的迭代器,它是用 C 实现,而不是生成器理解。它还使用 sb.__contains__ 作为映射函数。我不知道这会带来多大的性能差异。还是会短路。

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any's short-circuiting.

E.g.:

lists_overlap([3,4,5], [1,2,3])

will return True as soon as it gets to 3 in sb

EDIT: Another variation (with thanks to Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

This relies on imap's iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

梦里°也失望 2024-09-14 21:21:43

您还可以将 any 与列表理解一起使用:

any([item in a for item in b])

You could also use any with list comprehension:

any([item in a for item in b])
飘然心甜 2024-09-14 21:21:43

在 python 2.6 或更高版本中你可以这样做:

return not frozenset(a).isdisjoint(frozenset(b))

In python 2.6 or later you can do:

return not frozenset(a).isdisjoint(frozenset(b))
谁人与我共长歌 2024-09-14 21:21:43

您可以使用任何内置函数 /wa 生成器表达式:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

正如 John 和 Lie 所指出的,当两个列表共享的每个 i bool(i) == False 时,这会给出错误的结果。应该是:

return any(i in b for i in a)

You can use the any built in function /w a generator expression:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

return any(i in b for i in a)
余厌 2024-09-14 21:21:43

这个问题已经很老了,但我注意到,当人们争论集合与列表时,没有人想到将它们一起使用。按照 Soravux 的示例,

列表的最坏情况:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

列表的最佳情况:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

因此,比迭代两个列表更快的是迭代列表以查看它是否在集合中,这是有意义的,因为检查数字是否在集合中需要通过迭代列表进行检查时的恒定时间所花费的时间与列表的长度成正比。

因此,我的结论是迭代列表,并检查它是否在集合中

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux's example,

Worst case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

And the best case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

So even faster than iterating through two lists is iterating though a list to see if it's in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.

Thus, my conclusion is that iterate through a list, and check if it's in a set.

晌融 2024-09-14 21:21:43

如果您不关心重叠元素是什么,您可以简单地检查组合列表的 len 与组合为集合的列表的 len 。如果存在重叠元素,则集合会更短:

如果没有重叠,len(set(a+b+c))==len(a+b+c) 返回 True。

if you don't care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:

len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.

夜灵血窟げ 2024-09-14 21:21:43

我将用函数式编程风格引入另一个:

any(map(lambda x: x in a, b))

解释:

map(lambda x: x in a, b)

返回一个布尔值列表,其中 b 的元素可在 a 中找到。然后将该列表传递给 any,如果任何元素为 True,则返回 True

I'll throw another one in with a functional programming style:

any(map(lambda x: x in a, b))

Explanation:

map(lambda x: x in a, b)

returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文