Python：大整型键的快速字典

发布于 2024-11-05 14:56:43 字数 1556 浏览 5 评论 0原文

我有一个超过 10.000 个 int 项目的列表。项目的值可能非常高，最高可达 10^27。现在我想创建所有项目对并计算它们的总和。然后我想寻找具有相同总和的不同对。

例如：

l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...

pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...

pairs[7] 的内容就是我正在寻找的内容。它给了我两对具有相同值和的值。

我已经实现它如下 - 我想知道是否可以做得更快。目前，在快速机器上处理 10,000 件商品需要超过 6 小时。（正如我所说，l 的值以及 pairs 的键都是最大 10^27 的整数。）

l = [4,3,6,1]
pairs = {}
for i in range( len( l  )  ):
    for j in range(i+1, len( l ) ):
        s = l[i] + l[j]
        if not s in pairs:
            pairs[s] = []
        pairs[s].append((i,j))

# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}

编辑： 我想要按照西蒙·斯特林的要求添加一些背景。

找到形式类比

lays : laid :: says : said

目标是在单词列表中

[ lays, lay, laid, says, said, foo, bar ... ]

，就像我已经有一个函数 analogy(a,b,c,d) 给出 True if a :b::c:d。但是，我需要检查从列表创建的所有可能的四元组，这将是 O((n^4)/2) 左右的复杂性。

作为预过滤器，我想使用字符计数属性。它表示每个字符在 (a,d) 和 (b,c) 中具有相同的计数。例如，在“layssaid”中我们有 2 个 a，所以我们在“laidsays”中也是如此。

所以到目前为止的想法是

为每个单词创建一个“字符计数向量”并将其表示为一个整数（列表中的项目） l)
创建pairs 中的所有配对，并查看是否存在“对簇”，即特定字符计数向量和的多个对。

它确实有效，只是速度很慢。复杂度降至 O((n^2)/2) 左右，但这仍然很多，尤其是字典查找和插入经常完成。

原文

I have got a list of >10.000 int items. The values of the items can be very high, up to 10^27. Now I want to create all pairs of the items and calculate their sum. Then I want to look for different pairs with the same sum.

For example:

l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...

pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...

The contents of pairs[7] is what I am looking for. It gives me two pairs with the same value sum.

I have implemented it as follows - and I wonder if it can be done faster. Currently, for 10.000 items it takes >6 hours on a fast machine. (As I said, the values of l and so the keys of pairs are ints up to 10^27.)

l = [4,3,6,1]
pairs = {}
for i in range( len( l  )  ):
    for j in range(i+1, len( l ) ):
        s = l[i] + l[j]
        if not s in pairs:
            pairs[s] = []
        pairs[s].append((i,j))

# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}

Edit: I want to add some background, as asked by Simon Stelling.

The goal is to find Formal Analogies like

lays : laid :: says : said

within a list of words like

[ lays, lay, laid, says, said, foo, bar ... ]

I already have a function analogy(a,b,c,d) giving True if a : b :: c : d. However, I would need to check all possible quadruples created from the list, which would be a complexity of around O((n^4)/2).

As a pre-filter, I want to use the char-count property. It says that every char has the same count in (a,d) and in (b,c). For instance, in "layssaid" we have got 2 a's, and so we do in "laidsays"

So the idea until now was

for every word to create a "char count vector" and represent it as an integer (the items in the list l)
create all pairings in pairs and see if there are "pair clusters", i.e. more than one pair for a particular char count vector sum.

And it works, it's just slow. The complexity is down to around O((n^2)/2) but this is still a lot, and especially the dictionary lookup and insert is done that often.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尤怨 2024-11-12 14:56:43

有一些琐碎的优化，例如在局部变量中缓存常量值以及使用 xrange 而不是 range：

pairs = {}
len_l = len(l)
for i in xrange(len_l):
    for j in xrange(i+1, len_l):
        s = l[i] + l[j]
        res = pairs.setdefault(s, [])
        res.append((i,j))

但是，不预先计算列表并使用 xrange 可能更明智。相反，在概念层面上优化该方法。您想要实现的内在目标是什么？你真的只想计算你所做的事情吗？或者您打算将该结果用于其他用途吗？那是什么别的东西？

There are the trivial optimizations like caching constant values in a local variable and using xrange instead of range:

pairs = {}
len_l = len(l)
for i in xrange(len_l):
    for j in xrange(i+1, len_l):
        s = l[i] + l[j]
        res = pairs.setdefault(s, [])
        res.append((i,j))

However, it is probably far more wise to not pre-calculate the list and instead optimize the method on a concept level. What is the intrinsic goal you want to achieve? Do you really just want to calculate what you do? Or are you going to use that result for something else? What is that something else?

回复收藏 0 原文

往日 2024-11-12 14:56:43

只是一个提示。查看 itertools.combinations。

这并不完全是您正在寻找的（因为它存储一对值，而不是索引），但它可以是一个起始代码：

from itertools import combinations
for (a, b) in combinations(l, 2):
    pairs.setdefault(a + b, []).append((a, b))

Just a hint. Have a look on itertools.combinations.

This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code:

from itertools import combinations
for (a, b) in combinations(l, 2):
    pairs.setdefault(a + b, []).append((a, b))

回复收藏 0 原文

戏剧牡丹亭 2024-11-12 14:56:43

SimonStelling 的上述评论是正确的；生成所有可能的对从根本上来说很慢，除了改变算法之外，你无能为力。从 itertools 使用的正确函数是 product；你可以通过不创建额外的变量或执行不必要的列表索引来获得一些小的改进，但在幕后这些仍然都是 O(n^2)。我会这样做：

from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
    pairs.setdefault(m+n, []).append((m,n))

The above comment from SimonStelling is correct; generating all possible pairs is just fundamentally slow, and there's nothing you can do about it aside from altering your algorithm. The correct function to use from itertools is product; and you can get some minor improvements from not creating extra variables or doing unnecessary list indexes, but underneath the hood these are still all O(n^2). Here's how I would do it:

from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
    pairs.setdefault(m+n, []).append((m,n))

回复收藏 0 原文

饮湿 2024-11-12 14:56:43

最后，我想出了自己的解决方案，平均只需要一半的计算时间。

基本思想：我首先将所有总和收集到一个列表中，而不是读取和写入不断增长的字典 n^2 次。然后我对列表进行排序。然后，我在排序列表中查找相同的相邻项目。

这是代码：

from operator import itemgetter

def getPairClusters( l ):

    # first, we just store all possible pairs sequentially
    # clustering will happen later
    pairs = []

    for i in xrange( len( l)  ):
        for j in xrange(i+1, len( l ) ):
            pair = l[i] + l[j]
            pairs.append( ( pair, i, j ) )
    pairs.sort(key=itemgetter(0))

    # pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
    # a list item of pairs now contains a tuple (like (4, 1, 3)) with
    # * the sum of two l items: 4
    # * the index of the two l items: 1, 3

    # now clustering starts
    # we want to find neighbouring items as
    # (7, 0, 1), (7, 2, 3)
    # (since 7=7)

    pairClusters = []

    # flag if we are within a cluster
    # while iterating over pairs list
    withinCluster = False

            # iterate over pair list
    for i in xrange(len(pairs)-1):
        if not withinCluster:
            if pairs[i][0] == pairs[i+1][0]:
                # if not within a cluster
                # and found 2 neighbouring same numbers:
                # init new cluster
                pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
                withinCluster = True
        else:
            # if still within cluster
            if pairs[i][0] == pairs[i+1][0]:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
            # else cluster has ended
            # (next neighbouring item has different number)
            else:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
                pairClusters.append(pairCluster)
                withinCluster = False

    return pairClusters

l = [4,3,6,1]

print getPairClusters(l)

Finally, I have came up with my own solution, just taking half of the calculation time on average.

The basic idea: Instead of reading and writing into the growing dictionary n^2 times, I first collect all the sums in a list. Then I sort the list. Within the sorted list, I then look for same neighbouring items.

This is the code:

from operator import itemgetter

def getPairClusters( l ):

    # first, we just store all possible pairs sequentially
    # clustering will happen later
    pairs = []

    for i in xrange( len( l)  ):
        for j in xrange(i+1, len( l ) ):
            pair = l[i] + l[j]
            pairs.append( ( pair, i, j ) )
    pairs.sort(key=itemgetter(0))

    # pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
    # a list item of pairs now contains a tuple (like (4, 1, 3)) with
    # * the sum of two l items: 4
    # * the index of the two l items: 1, 3

    # now clustering starts
    # we want to find neighbouring items as
    # (7, 0, 1), (7, 2, 3)
    # (since 7=7)

    pairClusters = []

    # flag if we are within a cluster
    # while iterating over pairs list
    withinCluster = False

            # iterate over pair list
    for i in xrange(len(pairs)-1):
        if not withinCluster:
            if pairs[i][0] == pairs[i+1][0]:
                # if not within a cluster
                # and found 2 neighbouring same numbers:
                # init new cluster
                pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
                withinCluster = True
        else:
            # if still within cluster
            if pairs[i][0] == pairs[i+1][0]:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
            # else cluster has ended
            # (next neighbouring item has different number)
            else:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
                pairClusters.append(pairCluster)
                withinCluster = False

    return pairClusters

l = [4,3,6,1]

print getPairClusters(l)

回复收藏 0 原文

~没有更多了~