python中对象的高效数据结构,用于基于任何对象成员变量进行查找

发布于 2024-11-08 17:06:50 字数 572 浏览 0 评论 0原文

我需要存储具有多个(> 2)整数成员变量的对象,并使用任何成员变量作为搜索键进行快速查找。

为了更容易说明,假设对象是 3 个整数的元组,我需要使用元组的任何元素作为此类元组列表中的键进行快速查找:

collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]

查找将类似于:

collection.lookup_by_first_element(1) # Return (1, 200, 9)
collection.lookup_by_second_element(300) # Return (2, 300, 8)
collection.lookup_by_third_element(250) # Return None

我需要查找快速/高效。到目前为止,我最好的选择是使用内存中的 sqlite 数据库,其中包含三个元组元素的三列,并将索引放在所有三列上。

搜索树也可以,但是它们有一个用于查找的键,我不知道如何基于多个键进行查找。你会怎么做?

I need to store objects that have a number (>2) of integer member variables and do quick look-ups using any member variables as a search key.

For easier illustration let's say the objects are tuples of 3 integers and I need to do quick look-ups using any element of a tuple as a key in a list of such tuples:

collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]

Look-ups would be like:

collection.lookup_by_first_element(1) # Return (1, 200, 9)
collection.lookup_by_second_element(300) # Return (2, 300, 8)
collection.lookup_by_third_element(250) # Return None

I need the lookups to be fast/efficient. My best bet so far is to use an in memory sqlite database with three columns for the three tuple elements and put indexes on all three columns.

A search tree would also do, but those have one key for lookup and I don't see how could I do lookups based on more than one key. How would you do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

花辞树 2024-11-15 17:06:51

这是一个简单的解决方案。您可以轻松地将其放入类中并提供更简洁的界面。

>>> from collections import defaultdict
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> keyed_dict = defaultdict(list)
>>> for tup in collection:
...     for i, e in enumerate(tup):
...         keyed_dict[(i, e)].append(tup)
... 
>>> keyed_dict[(1, 300)]
[(2, 300, 8)]

更新

就其价值而言,上述方法比查找的numpy解决方案

from timeit import timeit

setup_code = '''
import numpy

clen = {0}  # use .format() to fill this value
collection = [(n, (n + 1) * 100, clen - n) for n in xrange(clen)]

npcollection = numpy.array(collection)

def numpy_lookup(collection, column, value):
    if numpy.any(collection[:, column] == value): return collection[collection[:, column] == value]
    return 'None'

keyed_dict = dict()
for tup in collection:
    for i, e in enumerate(tup):
        keyed_dict[i, e] = tup
'''

for e in range(5):
    setup = setup_code.format(str(10 ** e))
    kd_stmt = '[keyed_dict[0, n] for n in range({0})]'.format(str(10 ** e))
    np_stmt = '[numpy_lookup(npcollection, 0, n) for n in range({0})]'.format(str(10 ** e))
    print 'using keyed_dict: ',
    print timeit(stmt=kd_stmt, setup=setup, number=10)
    print 'using numpy_lookup: ',
    print timeit(stmt=np_stmt.format(str(10 ** e)), setup=setup, number=10)

输出:

using keyed_dict:  1.09672546387e-05
using numpy_lookup:  0.000250101089478
using keyed_dict:  3.00407409668e-05
using numpy_lookup:  0.00193691253662
using keyed_dict:  0.000190019607544
using numpy_lookup:  0.0199580192566
using keyed_dict:  0.00195384025574
using numpy_lookup:  0.317503929138
using keyed_dict:  0.0319399833679
using numpy_lookup:  15.0127439499

This is a simple solution. You could easily put this in a class and provide a neater interface.

>>> from collections import defaultdict
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> keyed_dict = defaultdict(list)
>>> for tup in collection:
...     for i, e in enumerate(tup):
...         keyed_dict[(i, e)].append(tup)
... 
>>> keyed_dict[(1, 300)]
[(2, 300, 8)]

Update:

For what it's worth, the above is way faster than the numpy solution for lookup:

from timeit import timeit

setup_code = '''
import numpy

clen = {0}  # use .format() to fill this value
collection = [(n, (n + 1) * 100, clen - n) for n in xrange(clen)]

npcollection = numpy.array(collection)

def numpy_lookup(collection, column, value):
    if numpy.any(collection[:, column] == value): return collection[collection[:, column] == value]
    return 'None'

keyed_dict = dict()
for tup in collection:
    for i, e in enumerate(tup):
        keyed_dict[i, e] = tup
'''

for e in range(5):
    setup = setup_code.format(str(10 ** e))
    kd_stmt = '[keyed_dict[0, n] for n in range({0})]'.format(str(10 ** e))
    np_stmt = '[numpy_lookup(npcollection, 0, n) for n in range({0})]'.format(str(10 ** e))
    print 'using keyed_dict: ',
    print timeit(stmt=kd_stmt, setup=setup, number=10)
    print 'using numpy_lookup: ',
    print timeit(stmt=np_stmt.format(str(10 ** e)), setup=setup, number=10)

output:

using keyed_dict:  1.09672546387e-05
using numpy_lookup:  0.000250101089478
using keyed_dict:  3.00407409668e-05
using numpy_lookup:  0.00193691253662
using keyed_dict:  0.000190019607544
using numpy_lookup:  0.0199580192566
using keyed_dict:  0.00195384025574
using numpy_lookup:  0.317503929138
using keyed_dict:  0.0319399833679
using numpy_lookup:  15.0127439499
呆橘 2024-11-15 17:06:51

senderle 是对的(我投了赞成票),但我想详细说明(不仅仅是在评论中)。

字典查找的时间复杂度为 O(1),而且速度非常快(基本上,您的密钥会变成一个散列,该散列可以寻址内存中的特定位置,以便立即从中检索数据)。

相比之下,通过搜索数组来查找值的速度至少为 O(N),因此对于较大的数组,找到正确的值将花费更长的时间。 (例如,您必须筛选所有 N 个键,找到正确的一个,然后可以返回元组。)

因此,如果您的键如您所说是唯一的,那么创建一个基于每个键进行索引的大型字典是有意义的您可以用来查找的键。当然,您必须在字典中表示 m 项(在您的情况下为 m=3)的每个元组,m 次,而使用单个数组时,只需在数组中表示一次。

所以你想定义一个Collection类

class Collection(object):
    def __init__(self, collection):
        self.collection_dict = dict()
        for tup in collection:
            for i, v in enumerate(tup):
                self.collection_dict[(i, v)] = tup
    def lookup_by_first_element(self, e):
        return self.collection_dict.get((0, e), None)
    def lookup_by_second_element(self, e):
        return self.collection_dict.get((1, e), None)
    def lookup_by_third_element(self, e):
        return self.collection_dict.get((2, e), None)


collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
c = Collection(collection)

内部c.collection_dict是:

{(0, 1): (1, 200, 9),
 (0, 2): (2, 300, 8),
 (0, 3): (3, 400, 7),
 (1, 200): (1, 200, 9),
 (1, 300): (2, 300, 8),
 (1, 400): (3, 400, 7),
 (2, 7): (3, 400, 7),
 (2, 8): (2, 300, 8),
 (2, 9): (1, 200, 9)}

并且你的查找按照你的要求工作

>>> c.lookup_by_first_element(1) == (1, 200, 9)
True
>>> c.lookup_by_second_element(300) == (2, 300, 8)
True
>>> c.lookup_by_third_element(250) is None
True

senderle is right (I upvoted), but I want to elaborate (and more than just in a comment).

Dictionary lookups are O(1) and really fast (basically your key gets turned into a hash that addresses a specific location in memory to immediately retrieve your data from).

In contrast looking for a value by searching through an array is slow at least O(N), so for larger arrays it will take longer to find the right value. (E.g., you have to sift through all N keys, find the right one, and then can return the tuple.)

So if your keys are unique as you say, it makes sense to just create a large dictionary that is indexed based on every key you may use to lookup. Granted is you will have to represent each tuple of m items (m=3 in your case), m times in the dictionary while with a single array it only had to be represented once in the array.

So you want to define a Collection class

class Collection(object):
    def __init__(self, collection):
        self.collection_dict = dict()
        for tup in collection:
            for i, v in enumerate(tup):
                self.collection_dict[(i, v)] = tup
    def lookup_by_first_element(self, e):
        return self.collection_dict.get((0, e), None)
    def lookup_by_second_element(self, e):
        return self.collection_dict.get((1, e), None)
    def lookup_by_third_element(self, e):
        return self.collection_dict.get((2, e), None)


collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
c = Collection(collection)

The internal c.collection_dict is:

{(0, 1): (1, 200, 9),
 (0, 2): (2, 300, 8),
 (0, 3): (3, 400, 7),
 (1, 200): (1, 200, 9),
 (1, 300): (2, 300, 8),
 (1, 400): (3, 400, 7),
 (2, 7): (3, 400, 7),
 (2, 8): (2, 300, 8),
 (2, 9): (1, 200, 9)}

And your lookups work as you requested

>>> c.lookup_by_first_element(1) == (1, 200, 9)
True
>>> c.lookup_by_second_element(300) == (2, 300, 8)
True
>>> c.lookup_by_third_element(250) is None
True
惯饮孤独 2024-11-15 17:06:51

使用numpy:

>>> import numpy as np
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> collection = np.array(collection)
>>> def f(d, c, v):
...     # d: collection, c: column, v: value
...     if np.any(d[:, c]==v): return d[d[:, c]==v]
...     return 'None'
...
>>> f(collection, 0, 1)
array([[  1, 200,   9]])
>>> f(collection, 1, 300)
array([[  2, 300,   8]])
>>> f(collection, 2, 250)
'None'

Using numpy:

>>> import numpy as np
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> collection = np.array(collection)
>>> def f(d, c, v):
...     # d: collection, c: column, v: value
...     if np.any(d[:, c]==v): return d[d[:, c]==v]
...     return 'None'
...
>>> f(collection, 0, 1)
array([[  1, 200,   9]])
>>> f(collection, 1, 300)
array([[  2, 300,   8]])
>>> f(collection, 2, 250)
'None'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文