如何有效地找到列表中的哪些元素?

发布于 2025-01-23 13:26:00 字数 423 浏览 0 评论 0原文

我想知道list_1的哪些元素在list_2中。我需要输出作为布尔人的有序列表。但是我想避免使用循环,因为这两个列表都有超过200万个元素。

这是我拥有的,它的工作原理,但是它太慢了:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []
for i in list_1:
   booleans.append(i in list_2)

# booleans = [False, False, True, True, False, False]

我可以拆分列表并使用多线程,但是如果可能的话,我更喜欢更简单的解决方案。我知道一些功能,例如sum()使用向量操作。我正在寻找类似的东西。

如何使我的代码更加高效?

I want to know which elements of list_1 are in list_2. I need the output as an ordered list of booleans. But I want to avoid for loops, because both lists have over 2 million elements.

This is what I have and it works, but it's too slow:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []
for i in list_1:
   booleans.append(i in list_2)

# booleans = [False, False, True, True, False, False]

I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.

How can I make my code more efficient?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

So要识趣 2025-01-30 13:26:00

我认为实际上在较大的样本输入上介绍的一些解决方案实际上是有用的。对于此输入和我的计算机,我发现CardStdani的方法是最快的,其次是numpy isin()方法。

设置1

import random

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]

设置2

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(10_001, 20_000) for i in range(100_000)]

时机 - 从最快到最慢的订购(设置1)。

cardStdani-方法1


我建议将卡斯塔尼的方法转换为 list consension (请参阅这个问题 为什么列表综合速度更快)

s = set(list_2)
booleans = [i in s for i in list_1]

# setup 1
6.01 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.19 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

没有列表理解

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)

# setup 1
7.28 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
5.87 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

cardstdani-方法2 (在Timus的协助下)


common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]

# setup 1
8.3 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
6.01 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

使用集合交叉点方法

common = set(list_1).intersection(list_2)
booleans = [item in common for item in list_1]

# setup 1
10.1 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.82 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

numpy方法(crissal)


a1 = np.array(list_1)
a2 = np.array(list_2)

a = np.isin(a1, a2)

# setup 1
18.6 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
18.2 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2 (assuming list_1, list_2 already numpy arrays)
10.3 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

list classension


l = [i in list_2 for i in list_1]

# setup 1
4.85 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.6 s ± 823 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

sharim-方法1


booleans = list(map(lambda e: e in list_2, list_1))

# setup 1
4.88 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

使用__包含__方法

booleans = list(map(list_2.__contains__, list_1))

# setup 1
4.87 s ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.2 s ± 486 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

sharim-方法2


set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))

# setup 1
5.46 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
11.1 s ± 75.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

更改输入的长度


使用以下设置

import random 

list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]

并变化n[2 ** k for k in range(18)]中:

“在此处输入映像说明”

中使用以下设置

import random 

list_1 = [random.randint(1, n ** 2) for i in range(n)]
list_2 = [random.randint(1, n ** 2) for i in range(n)]

和变化n [2 * * * k对于K范围(18)] ,我们获得了相似的结果:

”在此处输入图像说明“

使用以下设置

list_1 = list(range(n))
list_2 = list(range(n, 2 * n))

并变化n in [2 ** k对于K范围(18)]

import random 

list_1 = [random.randint(1, n) for i in range(10 * n)]
list_2 = [random.randint(1, n) for i in range(10 * n)]

​] :

“在此处输入图像描述”

I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the numpy isin() approach.

Setup 1

import random

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]

Setup 2

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(10_001, 20_000) for i in range(100_000)]

Timings - ordered from fastest to slowest (setup 1).

Cardstdani - approach 1


I recommend converting Cardstdani's approach into a list comprehension (see this question for why list comprehensions are faster)

s = set(list_2)
booleans = [i in s for i in list_1]

# setup 1
6.01 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.19 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

No list comprehension

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)

# setup 1
7.28 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
5.87 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Cardstdani - approach 2 (with an assist from Timus)


common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]

# setup 1
8.3 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
6.01 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Using the set intersection method

common = set(list_1).intersection(list_2)
booleans = [item in common for item in list_1]

# setup 1
10.1 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.82 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

numpy approach (crissal)


a1 = np.array(list_1)
a2 = np.array(list_2)

a = np.isin(a1, a2)

# setup 1
18.6 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
18.2 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2 (assuming list_1, list_2 already numpy arrays)
10.3 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

list comprehension


l = [i in list_2 for i in list_1]

# setup 1
4.85 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.6 s ± 823 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim - approach 1


booleans = list(map(lambda e: e in list_2, list_1))

# setup 1
4.88 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using the __contains__ method

booleans = list(map(list_2.__contains__, list_1))

# setup 1
4.87 s ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.2 s ± 486 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim - approach 2


set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))

# setup 1
5.46 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
11.1 s ± 75.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Varying the length of the input


Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n ** 2) for i in range(n)]
list_2 = [random.randint(1, n ** 2) for i in range(n)]

and varying n in [2 ** k for k in range(18)], we obtain similar results:

enter image description here

Employing the following setup

list_1 = list(range(n))
list_2 = list(range(n, 2 * n))

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(10 * n)]
list_2 = [random.randint(1, n) for i in range(10 * n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

乄_柒ぐ汐 2025-01-30 13:26:00

您可以利用o(1)在操作员复杂性中的set()函数以使您的循环效率更高,因此您的最终算法将在o(n)时间,而不是o(n*n)

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)
print(booleans)

它甚至更快地作为列表理解:

s = set(list_2)
booleans = [i in s for i in list_1]

如果您只想知道元素,则可以使用十字路口在这样的集合中,由于使用set()函数,这将是一个有效的解决方案,该功能已经由其他Python工程师优化:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

print(set(list_1).intersection(set(list_2)))

输出:

{1, 2}

另外,要提供列表格式输出,您可以转动您的将结果设置为带有list()函数的列表:

print(list(set(list_1).intersection(set(list_2))))

You can take advantage of the O(1) in operator complexity for the set() function to make your for loop more efficient, so your final algorithm would run in O(n) time, instead of O(n*n):

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)
print(booleans)

It is even faster as a list comprehension:

s = set(list_2)
booleans = [i in s for i in list_1]

If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set() function, already optimized by other Python engineers:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

print(set(list_1).intersection(set(list_2)))

Output:

{1, 2}

Also, to provide a list format output, you can turn your resulting set into a list with list() function:

print(list(set(list_1).intersection(set(list_2))))
贪恋 2025-01-30 13:26:00

如果要使用向量方法,也可以使用numpy isin。这不是最快的方法,如 ODA的出色帖子,但这绝对是考虑的选择。

import numpy as np

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)
# array([False, False,  True,  True, False, False])

If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.

import numpy as np

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)
# array([False, False,  True,  True, False, False])
你在我安 2025-01-30 13:26:00

您可以使用MAP函数。

内部地图我使用lambda函数。如果您不熟悉 lambda> lambda 函数,那么您可以检查一下。

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)

但是

[False, False, True, True, False, False]

,如果您想要唯一的不相同的元素,那么而不是map函数,则可以使用相同的代码使用filter函数。

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)

输出

[1, 2]

编辑

我正在从代码中删除语句中的,因为中的也充当循环。我正在使用timeit模块检查。

您可以将此代码用于包含truefalse的列表。

这样,最快的方式比一个高于一个。

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)

输出

[False, False, True, True, False, False]

此内容是包含元素的列表。

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)

输出

[1,2]

因为OP不想使用lambda函数,因此。

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
    return set_2!=set_2-{e}

booleans = list(map(func,iter(list_1)))

我知道我的方式不是解决这个问题的最佳方法,因为我从不使用numpy

You can use the map function.

Inside map I use the lambda function. If you are not familiar with the lambda function then you can check this out.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

However, if you want the only elements which are not the same then instead of a map function you can use the filter function with the same code.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)

output

[1, 2]

Edited

I am removing the in statement from the code because in also acts as a loop. I am checking using the timeit module.

you can use this code for the list containing True and False.

This way is fastest then above one.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

This one is for the list containing the elements.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)

output

[1,2]

Because OP don't want to use lambda function then this.

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
    return set_2!=set_2-{e}

booleans = list(map(func,iter(list_1)))

I know my way isn't the best way to this answer this because I am never using NumPy much.

桃扇骨 2025-01-30 13:26:00

仅使用内置集交点方法可能更简单,但是如果您要比较很多列表,则对列表进行排序可能更快。对列表进行排序是n ln n,但是一旦将它们排序后,您可以通过检查元素是否匹配,并在线性时间内进行比较,如果不匹配,则可以转到当前元素较小的列表中的下一个项目。

It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.

凉薄对峙 2025-01-30 13:26:00

使用set()获取每个列表中唯一项目的列表

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []

set_1 = set(list_1)
set_2 = set(list_2)

if(set_1 & set_2):
  print(set_1 & set_2)
else:
  print("No common elements")

{1, 2}

Use set() to get a list of unique items in each list

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []

set_1 = set(list_1)
set_2 = set(list_2)

if(set_1 & set_2):
  print(set_1 & set_2)
else:
  print("No common elements")

Output:

{1, 2}
人海汹涌 2025-01-30 13:26:00

如果您知道该值是非负的,并且最大值比列表的长度小得多,那么使用Numpy的Bincount可能是使用集合的好选择。

np.bincount(list_1).astype(bool)[list_2]

如果list_1List_2恰好是numpy数组,那么这甚至比Set + List-Compolhension解决方案要快得多。 (在我的测试263 µs vs 7.37 ms中;但是,如果它们是Python列表,则比设定解决方案稍慢,使用8.07毫秒)

If you know the values are non-negative and the maximum value is much smaller than the length of the list, then using numpy's bincount might be a good alternative for using a set.

np.bincount(list_1).astype(bool)[list_2]

If list_1 and list_2 happen to be numpy arrays, this can even be a lot faster than the set + list-comprehension solution. (In my test 263 µs vs 7.37 ms; but if they're python lists, it's slightly slower than the set solution, with 8.07 ms)

旧时浪漫 2025-01-30 13:26:00

Spybug96的方法将效果最好,最快。如果要获得带有两个集合的共同元素的凹痕对象,则可以在最终集合上使用tuple()函数:

a = set(range(1, 6))
b = set(range(3, 9))
c = a & b
print(tuple(c))

Spybug96's method will work best and fastest. If you want to get an indented object with the common elements of the two sets you can use the tuple() function on the final set:

a = set(range(1, 6))
b = set(range(3, 9))
c = a & b
print(tuple(c))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文