python中是否有相当于非唯一集的数据结构?
我有一个非常大的整数列表列表,我想对其进行“hash()”以提高搜索速度。每个嵌套列表的结果哈希值需要独立于整数的顺序,并且仅依赖于列表中的值。这表明(冻结)集合作为适合哈希的数据结构。但是,我需要保留每个整数值,无论是否重复,这对于集合来说是一个阻碍。
因此,这让我对列表进行排序,转换为元组并进行散列,这非常慢,我想有更好的策略。
如果有任何关于如何更有效地做到这一点的建议,我将不胜感激。
I have a very large list of lists of integers which I would like to "hash()" to improve search speed. The resulting hashed value for each of the nested lists needs to be independent of the order of the integers and only on the values in the list. This suggested a (frozen) set as a suitable data structure to hash. However, I need to keep every integer value, whether or not duplicated, which is a show-stopper for sets.
So, that left me ordering the list, converting to a tuple and hashing which is quite slow and I imagine there are better strategies.
I'd appreciate any suggestions on how to do this more efficiently.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字典是哈希值。
在速度不太快的桌面上花了一秒钟左右的时间。
Dictionaries are hashes.
Took a second or so on a not wildly fast desktop.
您的数据结构很好,您需要的是一种计算满足您要求的散列的方法:整数的顺序并不重要,但需要尊重重复项,并且需要很快。
计算数字的乘积怎么样?得到的数字可以很好地作为哈希值。如果您想避免陷入会减慢速度的长整型,则可以将其保留在 32 位整数内。唯一的问题是零,但你可以跳过它们,它不会破坏散列,只会使其不那么具有区分性。
Your data structure is fine, what you need is a way to compute a hash over it that meets your requirements: order of integers doesn't matter, but duplicates need to be respected, and it needs to be fast.
How about computing the product of the numbers? The resulting number will work well as a hash. You can keep it within a 32-bit integer if you want to avoid slipping into longs that will slow you down. The only problem is zeros, but you can skip them, it won't break the hash, only make it less discriminating.
创建一个包含计数的字典。 (如果你的Python版本足够新,你可以使用Counter类来代替。从项目列表中构造一个集合并散列它,
但我不知道它会比你已经完成的更有效。
因为这个是一个散列,它不需要表示某些数字重复的事实,因此您可以使用:
Create a dictionary with the counts. (If your version of python is new enough you can use the Counter class instead. Construct a set out of the items list and hash that
But I don't know that it will be any more efficient that you've already done.
Since this is a hash, it doesn't need to represent the fact that some of your numbers are duplicated. So you could just use:
按照 Winston Ewert 建议使用计数器还有其他好处。您不仅可以自然地拥有您所描述的数据结构(
Counter
继承自dict
,因此是散列的),而且您还可以做一些巧妙的事情,例如算术可能对你的情况有用。Using
Counter
s as Winston Ewert suggested has other perks. Not only can you naturally have a data structure that you described (Counter
inherits fromdict
and is therefore hashed), but you can also do some neat stuff, like arithmetic that would probably be useful for your case.