python中是否有相当于非唯一集的数据结构?

发布于 2024-11-01 14:48:27 字数 220 浏览 4 评论 0原文

我有一个非常大的整数列表列表,我想对其进行“hash()”以提高搜索速度。每个嵌套列表的结果哈希值需要独立于整数的顺序,并且仅依赖于列表中的值。这表明(冻结)集合作为适合哈希的数据结构。但是,我需要保留每个整数值,无论是否重复,这对于集合来说是一个阻碍。

因此,这让我对列表进行排序,转换为元组并进行散列,这非常慢,我想有更好的策略。

如果有任何关于如何更有效地做到这一点的建议,我将不胜感激。

I have a very large list of lists of integers which I would like to "hash()" to improve search speed. The resulting hashed value for each of the nested lists needs to be independent of the order of the integers and only on the values in the list. This suggested a (frozen) set as a suitable data structure to hash. However, I need to keep every integer value, whether or not duplicated, which is a show-stopper for sets.

So, that left me ordering the list, converting to a tuple and hashing which is quite slow and I imagine there are better strategies.

I'd appreciate any suggestions on how to do this more efficiently.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

冷默言语 2024-11-08 14:48:27

字典是哈希值。

>>> def bag_random(d, n):
...     x = random.randint(0, n)
...     if x in d:
...             d[x] += 1
...     else:
...             d[x] = 1
... 
>>> a = {}
>>> for i in xrange(10**6):
...     bag_random(a, 100)
... 
>>> a
{0: 9856, 1: 9787, 2: 9944, 3: 9822, 4: 9978, 5: 9915, 6: 9915, 7: 9860, 8: 9926, 9: 9756, 10: 9914, 11: 10030, 12: 10009, 13: 9803, 14: 9918, 15: 10136, 16: 9818, 17: 9868, 18: 9893, 19: 9971, 20: 9998, 21: 9982, 22: 9884, 23: 9806, 24: 9998, 25: 9926, 26: 9977, 27: 10011, 28: 10030, 29: 9899, 30: 9808, 31: 9825, 32: 9980, 33: 9812, 34: 9928, 35: 9827, 36: 9934, 37: 9883, 38: 9913, 39: 9893, 40: 9822, 41: 9714, 42: 9871, 43: 9954, 44: 9989, 45: 9694, 46: 9878, 47: 9984, 48: 9893, 49: 9928, 50: 10093, 51: 9881, 52: 9828, 53: 9660, 54: 9884, 55: 9745, 56: 10048, 57: 9845, 58: 9916, 59: 9933, 60: 9944, 61: 9979, 62: 9992, 63: 9635, 64: 9811, 65: 9900, 66: 9950, 67: 9744, 68: 9829, 69: 10037, 70: 9929, 71: 9811, 72: 9830, 73: 10056, 74: 9957, 75: 9992, 76: 9777, 77: 9942, 78: 9809, 79: 9734, 80: 9855, 81: 10021, 82: 9914, 83: 10009, 84: 10018, 85: 9961, 86: 10036, 87: 9849, 88: 9951, 89: 9770, 90: 9795, 91: 9899, 92: 9975, 93: 9935, 94: 10037, 95: 9992, 96: 9868, 97: 10014, 98: 9689, 99: 9883, 100: 9878}

在速度不太快的桌面上花了一秒钟左右的时间。

Dictionaries are hashes.

>>> def bag_random(d, n):
...     x = random.randint(0, n)
...     if x in d:
...             d[x] += 1
...     else:
...             d[x] = 1
... 
>>> a = {}
>>> for i in xrange(10**6):
...     bag_random(a, 100)
... 
>>> a
{0: 9856, 1: 9787, 2: 9944, 3: 9822, 4: 9978, 5: 9915, 6: 9915, 7: 9860, 8: 9926, 9: 9756, 10: 9914, 11: 10030, 12: 10009, 13: 9803, 14: 9918, 15: 10136, 16: 9818, 17: 9868, 18: 9893, 19: 9971, 20: 9998, 21: 9982, 22: 9884, 23: 9806, 24: 9998, 25: 9926, 26: 9977, 27: 10011, 28: 10030, 29: 9899, 30: 9808, 31: 9825, 32: 9980, 33: 9812, 34: 9928, 35: 9827, 36: 9934, 37: 9883, 38: 9913, 39: 9893, 40: 9822, 41: 9714, 42: 9871, 43: 9954, 44: 9989, 45: 9694, 46: 9878, 47: 9984, 48: 9893, 49: 9928, 50: 10093, 51: 9881, 52: 9828, 53: 9660, 54: 9884, 55: 9745, 56: 10048, 57: 9845, 58: 9916, 59: 9933, 60: 9944, 61: 9979, 62: 9992, 63: 9635, 64: 9811, 65: 9900, 66: 9950, 67: 9744, 68: 9829, 69: 10037, 70: 9929, 71: 9811, 72: 9830, 73: 10056, 74: 9957, 75: 9992, 76: 9777, 77: 9942, 78: 9809, 79: 9734, 80: 9855, 81: 10021, 82: 9914, 83: 10009, 84: 10018, 85: 9961, 86: 10036, 87: 9849, 88: 9951, 89: 9770, 90: 9795, 91: 9899, 92: 9975, 93: 9935, 94: 10037, 95: 9992, 96: 9868, 97: 10014, 98: 9689, 99: 9883, 100: 9878}

Took a second or so on a not wildly fast desktop.

牵你手 2024-11-08 14:48:27

您的数据结构很好,您需要的是一种计算满足您要求的散列的方法:整数的顺序并不重要,但需要尊重重复项,并且需要很快。

计算数字的乘积怎么样?得到的数字可以很好地作为哈希值。如果您想避免陷入会减慢速度的长整型,则可以将其保留在 32 位整数内。唯一的问题是零,但你可以跳过它们,它不会破坏散列,只会使其不那么具有区分性。

LIMIT = 999999937 # largest 9-digit prime
def list_hash(l):
    h = 1
    for i in l:
        if i:
            h *= i
            h %= LIMIT
    return h

Your data structure is fine, what you need is a way to compute a hash over it that meets your requirements: order of integers doesn't matter, but duplicates need to be respected, and it needs to be fast.

How about computing the product of the numbers? The resulting number will work well as a hash. You can keep it within a 32-bit integer if you want to avoid slipping into longs that will slow you down. The only problem is zeros, but you can skip them, it won't break the hash, only make it less discriminating.

LIMIT = 999999937 # largest 9-digit prime
def list_hash(l):
    h = 1
    for i in l:
        if i:
            h *= i
            h %= LIMIT
    return h
与之呼应 2024-11-08 14:48:27

创建一个包含计数的字典。 (如果你的Python版本足够新,你可以使用Counter类来代替。从项目列表中构造一个集合并散列它,

counter = collections.defaultdict(int)
for item in items:
     counter[item] += 1
return hash(frozenset(counter.items()))

但我不知道它会比你已经完成的更有效。

因为这个是一个散列,它不需要表示某些数字重复的事实,因此您可以使用:

return hash(frozenset(items))

Create a dictionary with the counts. (If your version of python is new enough you can use the Counter class instead. Construct a set out of the items list and hash that

counter = collections.defaultdict(int)
for item in items:
     counter[item] += 1
return hash(frozenset(counter.items()))

But I don't know that it will be any more efficient that you've already done.

Since this is a hash, it doesn't need to represent the fact that some of your numbers are duplicated. So you could just use:

return hash(frozenset(items))
故事还在继续 2024-11-08 14:48:27

按照 Winston Ewert 建议使用计数器还有其他好处。您不仅可以自然地拥有您所描述的数据结构(Counter 继承自 dict,因此是散列的),而且您还可以做一些巧妙的事情,例如算术可能对你的情况有用。

from collections import Counter
set1 = Counter("hello")
set2 = Counter("hell")

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1, 'o': 1})

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1})

set1 - set2
>> Counter({'o': 1})

Using Counters as Winston Ewert suggested has other perks. Not only can you naturally have a data structure that you described (Counter inherits from dict and is therefore hashed), but you can also do some neat stuff, like arithmetic that would probably be useful for your case.

from collections import Counter
set1 = Counter("hello")
set2 = Counter("hell")

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1, 'o': 1})

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1})

set1 - set2
>> Counter({'o': 1})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文