Python:大整型键的快速字典
我有一个超过 10.000 个 int 项目的列表。项目的值可能非常高,最高可达 10^27。现在我想创建所有项目对并计算它们的总和。然后我想寻找具有相同总和的不同对。
例如:
l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...
pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...
pairs[7]
的内容就是我正在寻找的内容。它给了我两对具有相同值和的值。
我已经实现它如下 - 我想知道是否可以做得更快。目前,在快速机器上处理 10,000 件商品需要超过 6 小时。 (正如我所说,l
的值以及 pairs
的键都是最大 10^27 的整数。)
l = [4,3,6,1]
pairs = {}
for i in range( len( l ) ):
for j in range(i+1, len( l ) ):
s = l[i] + l[j]
if not s in pairs:
pairs[s] = []
pairs[s].append((i,j))
# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}
编辑: 我想要按照西蒙·斯特林的要求添加一些背景。
找到形式类比
lays : laid :: says : said
目标是在单词列表中
[ lays, lay, laid, says, said, foo, bar ... ]
,就像我已经有一个函数 analogy(a,b,c,d)
给出 True
if a :b::c:d
。但是,我需要检查从列表创建的所有可能的四元组,这将是 O((n^4)/2) 左右的复杂性。
作为预过滤器,我想使用字符计数属性。它表示每个字符在 (a,d) 和 (b,c) 中具有相同的计数。例如,在“layssaid”中我们有 2 个 a,所以我们在“laidsays”中也是如此。
所以到目前为止的想法是
- 为每个单词创建一个“字符计数向量”并将其表示为一个整数(列表中的项目)
l
) - 创建
pairs
中的所有配对,并查看是否存在“对簇”,即特定字符计数向量和的多个对。
它确实有效,只是速度很慢。复杂度降至 O((n^2)/2) 左右,但这仍然很多,尤其是字典查找和插入经常完成。
I have got a list of >10.000 int items. The values of the items can be very high, up to 10^27. Now I want to create all pairs of the items and calculate their sum. Then I want to look for different pairs with the same sum.
For example:
l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...
pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...
The contents of pairs[7]
is what I am looking for. It gives me two pairs with the same value sum.
I have implemented it as follows - and I wonder if it can be done faster. Currently, for 10.000 items it takes >6 hours on a fast machine. (As I said, the values of l
and so the keys of pairs
are ints up to 10^27.)
l = [4,3,6,1]
pairs = {}
for i in range( len( l ) ):
for j in range(i+1, len( l ) ):
s = l[i] + l[j]
if not s in pairs:
pairs[s] = []
pairs[s].append((i,j))
# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}
Edit: I want to add some background, as asked by Simon Stelling.
The goal is to find Formal Analogies like
lays : laid :: says : said
within a list of words like
[ lays, lay, laid, says, said, foo, bar ... ]
I already have a function analogy(a,b,c,d)
giving True
if a : b :: c : d
. However, I would need to check all possible quadruples created from the list, which would be a complexity of around O((n^4)/2).
As a pre-filter, I want to use the char-count property. It says that every char has the same count in (a,d) and in (b,c). For instance, in "layssaid" we have got 2 a's, and so we do in "laidsays"
So the idea until now was
- for every word to create a "char count vector" and represent it as an integer (the items in the list
l
) - create all pairings in
pairs
and see if there are "pair clusters", i.e. more than one pair for a particular char count vector sum.
And it works, it's just slow. The complexity is down to around O((n^2)/2) but this is still a lot, and especially the dictionary lookup and insert is done that often.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
有一些琐碎的优化,例如在局部变量中缓存常量值以及使用
xrange
而不是range
:但是,不预先计算列表并使用
xrange
可能更明智。相反,在概念层面上优化该方法。您想要实现的内在目标是什么?你真的只想计算你所做的事情吗?或者您打算将该结果用于其他用途吗?那是什么别的东西?There are the trivial optimizations like caching constant values in a local variable and using
xrange
instead ofrange
:However, it is probably far more wise to not pre-calculate the list and instead optimize the method on a concept level. What is the intrinsic goal you want to achieve? Do you really just want to calculate what you do? Or are you going to use that result for something else? What is that something else?
只是一个提示。查看 itertools.combinations。
这并不完全是您正在寻找的(因为它存储一对值,而不是索引),但它可以是一个起始代码:
Just a hint. Have a look on itertools.combinations.
This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code:
SimonStelling 的上述评论是正确的;生成所有可能的对从根本上来说很慢,除了改变算法之外,你无能为力。从
itertools
使用的正确函数是product
;你可以通过不创建额外的变量或执行不必要的列表索引来获得一些小的改进,但在幕后这些仍然都是 O(n^2)。我会这样做:The above comment from SimonStelling is correct; generating all possible pairs is just fundamentally slow, and there's nothing you can do about it aside from altering your algorithm. The correct function to use from
itertools
isproduct
; and you can get some minor improvements from not creating extra variables or doing unnecessary list indexes, but underneath the hood these are still all O(n^2). Here's how I would do it:最后,我想出了自己的解决方案,平均只需要一半的计算时间。
基本思想:我首先将所有总和收集到一个列表中,而不是读取和写入不断增长的字典 n^2 次。然后我对列表进行排序。然后,我在排序列表中查找相同的相邻项目。
这是代码:
Finally, I have came up with my own solution, just taking half of the calculation time on average.
The basic idea: Instead of reading and writing into the growing dictionary n^2 times, I first collect all the sums in a list. Then I sort the list. Within the sorted list, I then look for same neighbouring items.
This is the code: