检查列表中是否存在值的最快方法
检查某个值是否存在于非常大的列表(具有数百万个值)及其索引是什么的最快方法是什么?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
检查某个值是否存在于非常大的列表(具有数百万个值)及其索引是什么的最快方法是什么?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(12)
最清晰、最快的方法。
您还可以考虑使用
集合
,但从列表构建该集合可能需要比更快的成员资格测试节省的时间更多的时间。唯一确定的方法就是进行良好的基准测试。 (这也取决于你需要什么操作)Clearest and fastest way to do it.
You can also consider using a
set
, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)正如其他人所说,对于大型列表,
in
可能会非常慢。以下是in
、set
和bisect
的一些性能比较。请注意时间(以秒为单位)采用对数刻度。测试代码:
As stated by others,
in
can be very slow for large lists. Here are some comparisons of the performances forin
,set
andbisect
. Note the time (in second) is in log scale.Code for testing:
您可以将您的项目放入
set
。集合查找非常有效。尝试:
编辑 在评论中,您说您想要获取元素的索引。不幸的是,集合没有元素位置的概念。另一种方法是预先对列表进行排序,然后在每次需要查找时使用二分搜索一个元素。
You could put your items into a
set
. Set lookups are very efficient.Try:
edit In a comment you say that you'd like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.
原来的问题是:
因此,需要查找两件事:
为此,我修改了 @xslittlegrass 代码来计算所有情况下的索引,并添加了一个额外的方法。
结果
方法是:
注意 @xslittlegrass 的 mod,他返回排序 b 中的索引,
而不是原来的b)
d[x] 提供 x 的索引。
结果表明方法 5 是最快的。
有趣的是,try 和 set 方法在时间上是相等的。
测试代码
The original question was:
Thus there are two things to find:
Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.
Results
Methods are:
Note mod from @xslittlegrass who returns the index in the sorted b,
rather than the original b)
d[x] provides the index of x.
Results show that method 5 is the fastest.
Interestingly the try and the set methods are equivalent in time.
Test Code
用法
我相信这是了解所选值是否在数组中的最快方法。
Usage
I believe this is the fastest way to know if a chosen value is in an array.
只有当 a 不改变时,这才是一个好主意,因此我们可以执行一次 dict() 部分,然后重复使用它。如果确实发生变化,请提供有关您正在做什么的更多详细信息。
This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.
如果您只想检查列表中一个元素是否存在,
这是最快的解决方案。请注意,尽管这
是一个近乎免费的操作,与集合的大小无关!从大型列表创建集合比
in
慢 300 到 400 倍,因此如果您需要检查许多元素,首先创建集合会更快。使用 perfplot 创建的图:
If you only want to check the existence of one element in a list,
is the fastest solution. Note though that
is a near-free operation, independently of the size of the set! Creating a set from a large list is 300 to 400 times slower than
in
, so if you need to check for many elements, creating a set first is faster.Plot created with perfplot:
请注意,
in
运算符不仅测试相等性 (==
),还测试同一性 (is
),in
>list
的逻辑大致相当于以下内容(它实际上是用 C 编写的,而不是不过Python,至少在CPython):在大多数情况下,这个细节是无关紧要的,但在某些情况下,它可能会让 Python 新手感到惊讶,例如,
numpy.NAN
具有 不等于自身:要区分这些不寻常的情况,您可以使用
any()
喜欢:注意带有
any()
的list
的in
逻辑是:但是,我应该强调这是一种边缘情况,对于绝大多数人来说在大多数情况下,
in
运算符经过高度优化,当然正是您想要的(使用list
或使用set
)。Be aware that the
in
operator tests not only equality (==
) but also identity (is
), thein
logic forlist
s is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example,
numpy.NAN
has the unusual property of being not being equal to itself:To distinguish between these unusual cases you could use
any()
like:Note the
in
logic forlist
s withany()
would be:However, I should emphasize that this is an edge case, and for the vast majority of cases the
in
operator is highly optimised and exactly what you want of course (either with alist
or with aset
).听起来您的应用程序可能会从使用布隆过滤器数据结构中获得优势。
简而言之,布隆过滤器查找可以非常快速地告诉您某个值是否绝对不存在于集合中。否则,您可以进行较慢的查找来获取列表中可能存在的值的索引。因此,如果您的应用程序往往比“找到”结果更频繁地获得“未找到”结果,那么通过添加布隆过滤器您可能会看到速度的提高。
有关详细信息,维基百科提供了布隆过滤器如何工作的良好概述,并且在网络上搜索“python 布隆过滤器库”将提供至少一些有用的实现。
It sounds like your application might gain advantage from the use of a Bloom Filter data structure.
In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.
For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.
检查列表中是否存在值
xslittlegrass的答案表明,当检查列表中是否存在多个值时,将列表转换为首先设置并在集合上使用
in
运算符比在列表上使用in
运算符要快得多。另一方面,Nico的回答表明,在检查列表中是否存在单个值时,将列表转换为集合首先不值得,因为转换成一套本身就很昂贵。总之,这些答案意味着有一些值转换为集合并检查这些值是否存在于集合中比检查它们是否存在于列表中更快。事实证明,这个数字非常小。下图显示了要检查的不同数量的值时,集合上的
in
和列表上的in
之间的运行时差异。如图所示,平均而言,如果您需要检查列表中是否存在 5 个(或更多)值,则首先将该列表转换为集合并检查该集合会更快。1< a href="https://i.sstatic.net/uSZ7U.png" rel="nofollow noreferrer">
如果列表中存在值,则获取其索引
另一方面,如果您想检查列表中是否存在值并返回存在值的索引,则无论长度如何对于列表中的少量值,直接在 try- except 块中使用
list.index()
搜索它是最快的方法。特别是,如果您想查找单个值的索引,这是最快的选择。但是,平均而言,如果要搜索的值超过 10 个,则构建索引查找字典(如 DarrylG 的回答) 是最快的选项。21 用于生成第一个图形的代码。
2 用于生成第二个数字的代码。
Check if values exist in a list
xslittlegrass's answer shows that when checking if multiple values exist in a list, converting the list into a set first and using the
in
operator on the set is much faster than using thein
operator on lists. On the other hand, Nico's answer shows that when checking if a single value exists in a list, converting the list into a set first is not worth it, as converting to a set itself is costly to begin with. Together, these answers imply that there is some number of values where converting to a set and checking if those values exist in the set becomes faster than checking if they exist in the list.It turns out, that number is very small. The figure below shows the runtime difference between
in
on sets andin
on lists for different numbers of values to check. As it shows, on average, if you need to check whether 5 (or more) values exist in a list, it's faster to convert that list into a set first and check on the set instead.1Get their indices if values exist in a list
On the other hand, if you want to check if values exist in a list and return the indices of the values that do, then regardless of the length of the list, for small number of values, directly searching for it using
list.index()
in a try-except block is the fastest way to do it. In particular, if you want to find the index of a single value, this is the fastest option. However, on average, if there are more than 10 values to search for, then constructing an index lookup dictionary (as in DarrylG's answer) is the fastest option.21 Code used to produce the first figure.
2 Code used to produce the second figure.
这不是代码,而是非常快速搜索的算法。
如果您的列表和您要查找的值都是数字,那么这非常简单。如果是字符串:查看底部:
如果您还需要原始元素您的号码的位置,在第二个索引列中查找它。
如果您的列表不是由数字组成,该方法仍然有效并且速度最快,但您可能需要定义一个可以比较/排序字符串的函数。
当然,这需要对sorted()方法进行投资,但是如果您继续重复使用相同的列表进行检查,那么可能是值得的。
This is not the code, but the algorithm for very fast searching.
If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:
If you also need the original position of your number, look for it in the second, index column.
If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.
Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.
空间数据的边缘情况
可能有更快的算法来处理空间数据(例如,重构以使用 kd 树),但检查向量是否在数组中的特殊情况很有用:
在这种情况下,我有兴趣知道由两个点定义的(无向)边是否在(无向)边的集合中,这样,其中
pair
构成两个任意长度的向量(即形状(2,N)
)。如果这些向量之间的距离有意义,则测试可以用浮点不等式来表示,例如
“列表中存在值”就是
any(test_result)
。下面是整数对和 R3 向量对的示例代码和虚拟测试集生成器。
Edge case for spatial data
There are probably faster algorithms for handling spatial data (e.g. refactoring to use a k-d tree), but the special case of checking if a vector is in an array is useful:
In this case, I was interested in knowing if an (undirected) edge defined by two points was in a collection of (undirected) edges, such that
where
pair
constitutes two vectors of arbitrary length (i.e. shape(2,N)
).If the distance between these vectors is meaningful, then the test can be expressed by a floating point inequality like
and "Value exists in List" is simply
any(test_result)
.Example code and dummy test set generators for integer pairs and R3 vector pairs are below.