检查列表中是否存在值的最快方法

发布于 2024-12-07 08:47:39 字数 48 浏览 2 评论 0 原文

检查某个值是否存在于非常大的列表(具有数百万个值)及其索引是什么的最快方法是什么?

What is the fastest way to check if a value exists in a very large list (with millions of values) and what its index is?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

浪漫人生路 2024-12-14 08:47:39
7 in a

最清晰、最快的方法。

您还可以考虑使用集合,但从列表构建该集合可能需要比更快的成员资格测试节省的时间更多的时间。唯一确定的方法就是进行良好的基准测试。 (这也取决于你需要什么操作)

7 in a

Clearest and fastest way to do it.

You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)

回忆那么伤 2024-12-14 08:47:39

正如其他人所说,对于大型列表, in 可能会非常慢。以下是 insetbisect 的一些性能比较。请注意时间(以秒为单位)采用对数刻度。

输入图像描述这里

测试代码:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time


def method_in(a, b, c):
    start_time = time.time()
    for i, x in enumerate(a):
        if x in b:
            c[i] = 1
    return time.time() - start_time


def method_set_in(a, b, c):
    start_time = time.time()
    s = set(b)
    for i, x in enumerate(a):
        if x in s:
            c[i] = 1
    return time.time() - start_time


def method_bisect(a, b, c):
    start_time = time.time()
    b.sort()
    for i, x in enumerate(a):
        index = bisect.bisect_left(b, x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return time.time() - start_time


def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    # adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
    Nls = [x for x in range(10000, 30000, 1000)]
    for N in Nls:
        a = [x for x in range(0, N)]
        random.shuffle(a)
        b = [x for x in range(0, N)]
        random.shuffle(b)
        c = [0 for x in range(0, N)]

        time_method_in.append(method_in(a, b, c))
        time_method_set_in.append(method_set_in(a, b, c))
        time_method_bisect.append(method_bisect(a, b, c))

    plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
    plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
    plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc='upper left')
    plt.yscale('log')
    plt.show()


profile()

As stated by others, in can be very slow for large lists. Here are some comparisons of the performances for in, set and bisect. Note the time (in second) is in log scale.

enter image description here

Code for testing:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time


def method_in(a, b, c):
    start_time = time.time()
    for i, x in enumerate(a):
        if x in b:
            c[i] = 1
    return time.time() - start_time


def method_set_in(a, b, c):
    start_time = time.time()
    s = set(b)
    for i, x in enumerate(a):
        if x in s:
            c[i] = 1
    return time.time() - start_time


def method_bisect(a, b, c):
    start_time = time.time()
    b.sort()
    for i, x in enumerate(a):
        index = bisect.bisect_left(b, x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return time.time() - start_time


def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    # adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
    Nls = [x for x in range(10000, 30000, 1000)]
    for N in Nls:
        a = [x for x in range(0, N)]
        random.shuffle(a)
        b = [x for x in range(0, N)]
        random.shuffle(b)
        c = [0 for x in range(0, N)]

        time_method_in.append(method_in(a, b, c))
        time_method_set_in.append(method_set_in(a, b, c))
        time_method_bisect.append(method_bisect(a, b, c))

    plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
    plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
    plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc='upper left')
    plt.yscale('log')
    plt.show()


profile()
祁梦 2024-12-14 08:47:39

您可以将您的项目放入 set 。集合查找非常有效。

尝试:

s = set(a)
if 7 in s:
  # do stuff

编辑 在评论中,您说您想要获取元素的索引。不幸的是,集合没有元素位置的概念。另一种方法是预先对列表进行排序,然后在每次需要查找时使用二分搜索一个元素。

You could put your items into a set. Set lookups are very efficient.

Try:

s = set(a)
if 7 in s:
  # do stuff

edit In a comment you say that you'd like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.

嘿嘿嘿 2024-12-14 08:47:39

原来的问题是:

知道列表中是否存在某个值的最快方法是什么(列表
其中有数百万个值)以及它的索引是什么?

因此,需要查找两件事:

  1. 列表中的项目,以及
  2. 索引是什么(如果在列表中)。

为此,我修改了 @xslittlegrass 代码来计算所有情况下的索引,并添加了一个额外的方法。

结果

在此处输入图像描述

方法是:

  1. in--基本上如果 x in b: return b.index(x)
  2. try--try/catch on b.索引(x)(跳过必须检查if x in b)
  3. set - 基本上 if x in set(b): return b.index(x)
  4. bisect - 用它的索引对 b 进行排序,在排序(b)中对 x 进行二分查找。
    注意 @xslittlegrass 的 mod,他返回排序 b 中的索引,
    而不是原来的b)
  5. reverse——为b形成一个反向查找字典d;然后
    d[x] 提供 x 的索引。

结果表明方法 5 是最快的。

有趣的是,tryset 方法在时间上是相等的。


测试代码

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

The original question was:

What is the fastest way to know if a value exists in a list (a list
with millions of values in it) and what its index is?

Thus there are two things to find:

  1. is an item in the list, and
  2. what is the index (if in the list).

Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.

Results

enter image description here

Methods are:

  1. in--basically if x in b: return b.index(x)
  2. try--try/catch on b.index(x) (skips having to check if x in b)
  3. set--basically if x in set(b): return b.index(x)
  4. bisect--sort b with its index, binary search for x in sorted(b).
    Note mod from @xslittlegrass who returns the index in the sorted b,
    rather than the original b)
  5. reverse--form a reverse lookup dictionary d for b; then
    d[x] provides the index of x.

Results show that method 5 is the fastest.

Interestingly the try and the set methods are equivalent in time.


Test Code

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()
心碎无痕… 2024-12-14 08:47:39
def check_availability(element, collection: iter):
    return element in collection

用法

check_availability('a', [1,2,3,4,'a','b','c'])

我相信这是了解所选值是否在数组中的最快方法。

def check_availability(element, collection: iter):
    return element in collection

Usage

check_availability('a', [1,2,3,4,'a','b','c'])

I believe this is the fastest way to know if a chosen value is in an array.

裂开嘴轻声笑有多痛 2024-12-14 08:47:39
a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

只有当 a 不改变时,这才是一个好主意,因此我们可以执行一次 dict() 部分,然后重复使用它。如果确实发生变化,请提供有关您正在做什么的更多详细信息。

a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.

妥活 2024-12-14 08:47:39

如果您只想检查列表中一个元素是否存在,

7 in list_data

这是最快的解决方案。请注意,尽管这

7 in set_data

是一个近乎免费的操作,与集合的大小无关!从大型列表创建集合比 in 慢 300 到 400 倍,因此如果您需要检查许多元素,首先创建集合会更快。

输入图像描述这里

使用 perfplot 创建的图:

import perfplot
import numpy as np


def setup(n):
    data = np.arange(n)
    np.random.shuffle(data)
    return data, set(data)


def list_in(data):
    return 7 in data[0]


def create_set_from_list(data):
    return set(data[0])


def set_in(data):
    return 7 in data[1]


b = perfplot.bench(
    setup=setup,
    kernels=[list_in, set_in, create_set_from_list],
    n_range=[2 ** k for k in range(24)],
    xlabel="len(data)",
    equality_check=None,
)
b.save("out.png")
b.show()

If you only want to check the existence of one element in a list,

7 in list_data

is the fastest solution. Note though that

7 in set_data

is a near-free operation, independently of the size of the set! Creating a set from a large list is 300 to 400 times slower than in, so if you need to check for many elements, creating a set first is faster.

enter image description here

Plot created with perfplot:

import perfplot
import numpy as np


def setup(n):
    data = np.arange(n)
    np.random.shuffle(data)
    return data, set(data)


def list_in(data):
    return 7 in data[0]


def create_set_from_list(data):
    return set(data[0])


def set_in(data):
    return 7 in data[1]


b = perfplot.bench(
    setup=setup,
    kernels=[list_in, set_in, create_set_from_list],
    n_range=[2 ** k for k in range(24)],
    xlabel="len(data)",
    equality_check=None,
)
b.save("out.png")
b.show()
吻风 2024-12-14 08:47:39

请注意,in 运算符不仅测试相等性 (==),还测试同一性 (is),in > list 的逻辑大致相当于以下内容(它实际上是用 C 编写的,而不是不过Python,至少在CPython):

对于 s 中的元素:
    如果元素是目标:
        # 快速检查身份意味着平等
        返回真
    如果元素==目标:
        # 较慢地检查实际相等性
        返回真
返回错误

在大多数情况下,这个细节是无关紧要的,但在某些情况下,它可能会让 Python 新手感到惊讶,例如,numpy.NAN 具有 不等于自身

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

要区分这些不寻常的情况,您可以使用 any()喜欢:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

注意带有 any()listin 逻辑是:

any(element is target or element == target for element in lst)

但是,我应该强调这是一种边缘情况,对于绝大多数人来说在大多数情况下,in 运算符经过高度优化,当然正是您想要的(使用 list 或使用 set)。

Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

To distinguish between these unusual cases you could use any() like:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

Note the in logic for lists with any() would be:

any(element is target or element == target for element in lst)

However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).

仙女山的月亮 2024-12-14 08:47:39

听起来您的应用程序可能会从使用布隆过滤器数据结构中获得优势。

简而言之,布隆过滤器查找可以非常快速地告诉您某个值是否绝对不存在于集合中。否则,您可以进行较慢的查找来获取列表中可能存在的值的索引。因此,如果您的应用程序往往比“找到”结果更频繁地获得“未找到”结果,那么通过添加布隆过滤器您可能会看到速度的提高。

有关详细信息,维基百科提供了布隆过滤器如何工作的良好概述,并且在网络上搜索“python 布隆过滤器库”将提供至少一些有用的实现。

It sounds like your application might gain advantage from the use of a Bloom Filter data structure.

In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.

For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.

无悔心 2024-12-14 08:47:39

检查列表中是否存在值

xslittlegrass的答案表明,当检查列表中是否存在多个值时,将列表转换为首先设置并在集合上使用 in 运算符比在列表上使用 in 运算符要快得多。另一方面,Nico的回答表明,在检查列表中是否存在单个值时,将列表转换为集合首先不值得,因为转换成一套本身就很昂贵。总之,这些答案意味着有一些值转换为集合并检查这些值是否存在于集合中比检查它们是否存在于列表中更快。

事实证明,这个数字非常小。下图显示了要检查的不同数量的值时,集合上的 in 和列表上的 in 之间的运行时差异。如图所示,平均而言,如果您需要检查列表中是否存在 5 个(或更多)值,则首先将该列表转换为集合并检查该集合会更快。1

< a href="https://i.sstatic.net/uSZ7U.png" rel="nofollow noreferrer">result


如果列表中存在值,则获取其索引

另一方面,如果您想检查列表中是否存在值并返回存在值的索引,则无论长度如何对于列表中的少量值,直接在 try- except 块中使用 list.index() 搜索它是最快的方法。特别是,如果您想查找单个值的索引,这是最快的选择。但是,平均而言,如果要搜索的值超过 10 个,则构建索引查找字典(如 DarrylG 的回答) 是最快的选项。2

result2


1 用于生成第一个图形的代码。

from random import sample
from timeit import repeat
from functools import partial
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

def list_in(a, b):
    return [x in b for x in a]


def set_in(a, b):
    s = set(b)
    return [x in s for x in a]


def profile(methods):
    sizes = range(1, 31, 2)
    colors = ['r', 'b', 'g']
    Ns = [100, 1000000]
    fig, axs = plt.subplots(1, len(Ns), figsize=(10, 4), facecolor='white')
    for N, ax in zip(Ns, axs):
        b = sample(range(N), k=N)
        times = {f.__name__: [] for f in methods}
        for size in sizes:
            a = sample(range(len(b)*3//2), k=size)
            for label, f in zip(times, methods):
                func = partial(f, a, b)
                times[label].append(min(repeat(func, number=10))/10)
        for (k, ts), c in zip(times.items(), colors):
            ax.plot(sizes, ts, f'{c}o-', label=k)
        ax.set_title(f'List size = {N:,d}', fontsize=18)
        ax.set_xlabel('Number of values to check', fontsize=14)
        ax.set_ylabel('Runtime', fontsize=14)
        ax.xaxis.set_major_locator(MultipleLocator(5))
        ax.legend()
    return fig


methods = [list_in, set_in]
fig = profile(methods)
fig.tight_layout();

2 用于生成第二个数字的代码。

def try_index(a, b):
    c = []
    for x in a:
        try:
            c.append(b.index(x))
        except ValueError:
            c.append(-1)
    return c


def set_in(a, b):
    s = set(b)
    return [b.index(x) if x in s else -1 for x in a]


def dict_lookup(a, b):
    # for faster lookups, convert dict to a function beforehand
    reverse_lookup = {x:i for i, x in enumerate(b)}.get
    return [reverse_lookup(x, -1) for x in a]


methods = [try_index, set_in, dict_lookup]
fig = profile(methods)
fig.suptitle('Get index of values that exist in a list', fontsize=20)
fig.tight_layout();

Check if values exist in a list

xslittlegrass's answer shows that when checking if multiple values exist in a list, converting the list into a set first and using the in operator on the set is much faster than using the in operator on lists. On the other hand, Nico's answer shows that when checking if a single value exists in a list, converting the list into a set first is not worth it, as converting to a set itself is costly to begin with. Together, these answers imply that there is some number of values where converting to a set and checking if those values exist in the set becomes faster than checking if they exist in the list.

It turns out, that number is very small. The figure below shows the runtime difference between in on sets and in on lists for different numbers of values to check. As it shows, on average, if you need to check whether 5 (or more) values exist in a list, it's faster to convert that list into a set first and check on the set instead.1

result


Get their indices if values exist in a list

On the other hand, if you want to check if values exist in a list and return the indices of the values that do, then regardless of the length of the list, for small number of values, directly searching for it using list.index() in a try-except block is the fastest way to do it. In particular, if you want to find the index of a single value, this is the fastest option. However, on average, if there are more than 10 values to search for, then constructing an index lookup dictionary (as in DarrylG's answer) is the fastest option.2

result2


1 Code used to produce the first figure.

from random import sample
from timeit import repeat
from functools import partial
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

def list_in(a, b):
    return [x in b for x in a]


def set_in(a, b):
    s = set(b)
    return [x in s for x in a]


def profile(methods):
    sizes = range(1, 31, 2)
    colors = ['r', 'b', 'g']
    Ns = [100, 1000000]
    fig, axs = plt.subplots(1, len(Ns), figsize=(10, 4), facecolor='white')
    for N, ax in zip(Ns, axs):
        b = sample(range(N), k=N)
        times = {f.__name__: [] for f in methods}
        for size in sizes:
            a = sample(range(len(b)*3//2), k=size)
            for label, f in zip(times, methods):
                func = partial(f, a, b)
                times[label].append(min(repeat(func, number=10))/10)
        for (k, ts), c in zip(times.items(), colors):
            ax.plot(sizes, ts, f'{c}o-', label=k)
        ax.set_title(f'List size = {N:,d}', fontsize=18)
        ax.set_xlabel('Number of values to check', fontsize=14)
        ax.set_ylabel('Runtime', fontsize=14)
        ax.xaxis.set_major_locator(MultipleLocator(5))
        ax.legend()
    return fig


methods = [list_in, set_in]
fig = profile(methods)
fig.tight_layout();

2 Code used to produce the second figure.

def try_index(a, b):
    c = []
    for x in a:
        try:
            c.append(b.index(x))
        except ValueError:
            c.append(-1)
    return c


def set_in(a, b):
    s = set(b)
    return [b.index(x) if x in s else -1 for x in a]


def dict_lookup(a, b):
    # for faster lookups, convert dict to a function beforehand
    reverse_lookup = {x:i for i, x in enumerate(b)}.get
    return [reverse_lookup(x, -1) for x in a]


methods = [try_index, set_in, dict_lookup]
fig = profile(methods)
fig.suptitle('Get index of values that exist in a list', fontsize=20)
fig.tight_layout();
烟酉 2024-12-14 08:47:39

这不是代码,而是非常快速搜索的算法。

如果您的列表和您要查找的值都是数字,那么这非常简单。如果是字符串:查看底部:

  • -让“n”为列表的长度
  • -可选步骤:如果您需要元素的索引:将第二列添加到列表中,其中包含元素的当前索引(0 到 n-1 ) - 请参阅稍后
  • 订购您的列表或其副本 (.sort())
  • 循环:
    • 将您的数字与列表的第 n/2 个元素进行比较
      • 如果更大,则在索引 n/2-n 之间再次循环
      • 如果较小,则在索引 0-n/2 之间再次循环
      • 如果相同:您找到了
  • 继续缩小列表,直到找到它或只有 2 个数字(在您要查找的元素的下方和上方)
  • 这将在对于 1.000.000 的列表最多 19 个步骤中找到任何元素(准确地说是 log(2)n)

如果您还需要原始元素您的号码的位置,在第二个索引列中查找它。

如果您的列表不是由数字组成,该方法仍然有效并且速度最快,但您可能需要定义一个可以比较/排序字符串的函数。

当然,这需要对sorted()方法进行投资,但是如果您继续重复使用相同的列表进行检查,那么可能是值得的。

This is not the code, but the algorithm for very fast searching.

If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:

  • -Let "n" be the length of your list
  • -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later
  • Order your list or a copy of it (.sort())
  • Loop through:
    • Compare your number to the n/2th element of the list
      • If larger, loop again between indexes n/2-n
      • If smaller, loop again between indexes 0-n/2
      • If the same: you found it
  • Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)
  • This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)

If you also need the original position of your number, look for it in the second, index column.

If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.

Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.

桃扇骨 2024-12-14 08:47:39

空间数据的边缘情况

可能有更快的算法来处理空间数据(例如,重构以使用 kd 树),但检查向量是否在数组中的特殊情况很有用:

  • 如果您有空间数据(即笛卡尔坐标)
  • 如果你有整数掩码(即数组过滤)

在这种情况下,我有兴趣知道由两个点定义的(无向)边是否在(无向)边的集合中,这样,其中

(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs

pair构成两个任意长度的向量(即形状(2,N))。

如果这些向量之间的距离有意义,则测试可以用浮点不等式来表示,例如

test_result = Norm(v1 - v2) < Tol

“列表中存在值”就是 any(test_result)

下面是整数对和 R3 向量对的示例代码和虚拟测试集生成器。

# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

# optional
try:
    from tqdm import tqdm
except ModuleNotFoundError:
    def tqdm(X, *args, **kwargs):
        return X
    print('tqdm not found. tqdm is a handy progress bar module.')
    

def get_float_r3_pairs(size):
    """ generate dummy vector pairs in R3  (i.e. case of spatial data) """
    coordinates = np.random.random(size=(size, 3))
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs
    
        
def get_int_pairs(size):
    """ generate dummy integer pairs (i.e. case of array masking) """
    coordinates = np.random.randint(0, size, size)
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs


def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
    """
    True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
    in [(a01, a02), ... (aik, ajl)]
    
    NB this is expected to be called in iteration so no sanitization is performed.

    Parameters
    ----------
    pair : np.ndarray
        pair of vectors with shape (2, M)
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        (pair in pairs) | (pair[::-1] in pairs).
    """
    m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    return m1 | m2


def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
    """
    apply float_tol_pair_in_pairs for pair in pairs
    
    Parameters
    ----------
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs

    """
    pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
    rv = [pairs[0]]
    for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
        if not any(float_tol_pair_in_pairs(pair, rv)):
            rv.append(pair)
    return np.array(rv)

计时结果

Edge case for spatial data

There are probably faster algorithms for handling spatial data (e.g. refactoring to use a k-d tree), but the special case of checking if a vector is in an array is useful:

  • If you have spatial data (i.e. cartesian coordinates)
  • If you have integer masks (i.e. array filtering)

In this case, I was interested in knowing if an (undirected) edge defined by two points was in a collection of (undirected) edges, such that

(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs

where pair constitutes two vectors of arbitrary length (i.e. shape (2,N)).

If the distance between these vectors is meaningful, then the test can be expressed by a floating point inequality like

test_result = Norm(v1 - v2) < Tol

and "Value exists in List" is simply any(test_result).

Example code and dummy test set generators for integer pairs and R3 vector pairs are below.

# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

# optional
try:
    from tqdm import tqdm
except ModuleNotFoundError:
    def tqdm(X, *args, **kwargs):
        return X
    print('tqdm not found. tqdm is a handy progress bar module.')
    

def get_float_r3_pairs(size):
    """ generate dummy vector pairs in R3  (i.e. case of spatial data) """
    coordinates = np.random.random(size=(size, 3))
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs
    
        
def get_int_pairs(size):
    """ generate dummy integer pairs (i.e. case of array masking) """
    coordinates = np.random.randint(0, size, size)
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs


def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
    """
    True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
    in [(a01, a02), ... (aik, ajl)]
    
    NB this is expected to be called in iteration so no sanitization is performed.

    Parameters
    ----------
    pair : np.ndarray
        pair of vectors with shape (2, M)
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        (pair in pairs) | (pair[::-1] in pairs).
    """
    m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    return m1 | m2


def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
    """
    apply float_tol_pair_in_pairs for pair in pairs
    
    Parameters
    ----------
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs

    """
    pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
    rv = [pairs[0]]
    for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
        if not any(float_tol_pair_in_pairs(pair, rv)):
            rv.append(pair)
    return np.array(rv)

timing results

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文