如何有效地找到列表中的哪些元素?
我想知道list_1
的哪些元素在list_2
中。我需要输出作为布尔人的有序列表。但是我想避免使用循环,因为这两个列表都有超过200万个元素。
这是我拥有的,它的工作原理,但是它太慢了:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
for i in list_1:
booleans.append(i in list_2)
# booleans = [False, False, True, True, False, False]
我可以拆分列表并使用多线程,但是如果可能的话,我更喜欢更简单的解决方案。我知道一些功能,例如sum()使用向量操作。我正在寻找类似的东西。
如何使我的代码更加高效?
I want to know which elements of list_1
are in list_2
. I need the output as an ordered list of booleans. But I want to avoid for
loops, because both lists have over 2 million elements.
This is what I have and it works, but it's too slow:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
for i in list_1:
booleans.append(i in list_2)
# booleans = [False, False, True, True, False, False]
I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.
How can I make my code more efficient?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我认为实际上在较大的样本输入上介绍的一些解决方案实际上是有用的。对于此输入和我的计算机,我发现CardStdani的方法是最快的,其次是
numpy
isin()
方法。设置1
设置2
时机 - 从最快到最慢的订购(设置1)。
cardStdani-方法1
我建议将卡斯塔尼的方法转换为 list consension (请参阅这个问题 为什么列表综合速度更快)
没有列表理解
cardstdani-方法2 (在Timus的协助下)
使用集合
交叉点
方法numpy方法(crissal)
list classension
sharim-方法1
使用
__包含__
方法sharim-方法2
更改输入的长度
使用以下设置
并变化
n
在[2 ** k for k in range(18)]
中:在
中使用以下设置
和变化
n
[2 * * * k对于K范围(18)] ,我们获得了相似的结果:使用以下设置
并变化
n
in[2 ** k对于K范围(18)]
:
] :
I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the
numpy
isin()
approach.Setup 1
Setup 2
Timings - ordered from fastest to slowest (setup 1).
Cardstdani - approach 1
I recommend converting Cardstdani's approach into a list comprehension (see this question for why list comprehensions are faster)
No list comprehension
Cardstdani - approach 2 (with an assist from Timus)
Using the set
intersection
methodnumpy approach (crissal)
list comprehension
Sharim - approach 1
Using the
__contains__
methodSharim - approach 2
Varying the length of the input
Employing the following setup
and varying
n
in[2 ** k for k in range(18)]
:Employing the following setup
and varying
n
in[2 ** k for k in range(18)]
, we obtain similar results:Employing the following setup
and varying
n
in[2 ** k for k in range(18)]
:Employing the following setup
and varying
n
in[2 ** k for k in range(18)]
:您可以利用
o(1)
在操作员复杂性中的set()
函数以使您的循环效率更高,因此您的最终算法将在o(n)时间,而不是
o(n*n)
:它甚至更快地作为列表理解:
如果您只想知道元素,则可以使用十字路口在这样的集合中,由于使用
set()
函数,这将是一个有效的解决方案,该功能已经由其他Python工程师优化:输出:
另外,要提供列表格式输出,您可以转动您的将结果设置为带有
list()
函数的列表:You can take advantage of the
O(1)
in operator complexity for theset()
function to make your for loop more efficient, so your final algorithm would run inO(n)
time, instead ofO(n*n)
:It is even faster as a list comprehension:
If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of
set()
function, already optimized by other Python engineers:Output:
Also, to provide a list format output, you can turn your resulting set into a list with
list()
function:如果要使用向量方法,也可以使用numpy isin。这不是最快的方法,如 ODA的出色帖子,但这绝对是考虑的选择。
If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.
您可以使用
MAP
函数。内部
地图
我使用lambda函数。如果您不熟悉 lambda> lambda 函数,那么您可以检查一下。但是
,如果您想要唯一的不相同的元素,那么而不是
map
函数,则可以使用相同的代码使用filter
函数。输出
编辑
我正在从代码中删除语句中的
,因为
中的也充当循环。我正在使用
timeit
模块检查。您可以将此代码用于包含
true
和false
的列表。这样,最快的方式比一个高于一个。
输出
此内容是包含元素的列表。
输出
因为OP不想使用lambda函数,因此。
我知道我的方式不是解决这个问题的最佳方法,因为我从不使用
numpy
。You can use the
map
function.Inside
map
I use the lambda function. If you are not familiar with the lambda function then you can check this out.output
However, if you want the only elements which are not the same then instead of a
map
function you can use thefilter
function with the same code.output
Edited
I am removing the
in
statement from the code becausein
also acts as a loop. I am checking using thetimeit
module.you can use this code for the list containing
True
andFalse
.This way is fastest then above one.
output
This one is for the list containing the elements.
output
Because OP don't want to use lambda function then this.
I know my way isn't the best way to this answer this because I am never using
NumPy
much.仅使用内置集交点方法可能更简单,但是如果您要比较很多列表,则对列表进行排序可能更快。对列表进行排序是n ln n,但是一旦将它们排序后,您可以通过检查元素是否匹配,并在线性时间内进行比较,如果不匹配,则可以转到当前元素较小的列表中的下一个项目。
It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.
使用
set()
获取每个列表中唯一项目的列表:
Use
set()
to get a list of unique items in each listOutput:
如果您知道该值是非负的,并且最大值比列表的长度小得多,那么使用Numpy的Bincount可能是使用集合的好选择。
如果
list_1
和List_2
恰好是numpy数组,那么这甚至比Set + List-Compolhension解决方案要快得多。 (在我的测试263 µs vs 7.37 ms中;但是,如果它们是Python列表,则比设定解决方案稍慢,使用8.07毫秒)If you know the values are non-negative and the maximum value is much smaller than the length of the list, then using numpy's bincount might be a good alternative for using a set.
If
list_1
andlist_2
happen to be numpy arrays, this can even be a lot faster than the set + list-comprehension solution. (In my test 263 µs vs 7.37 ms; but if they're python lists, it's slightly slower than the set solution, with 8.07 ms)Spybug96的方法将效果最好,最快。如果要获得带有两个集合的共同元素的凹痕对象,则可以在最终集合上使用
tuple()
函数:Spybug96's method will work best and fastest. If you want to get an indented object with the common elements of the two sets you can use the
tuple()
function on the final set: