如何在 Python 中优化大型（75,000 个项目）布尔值集的操作？

发布于 2024-09-28 17:11:55 字数 622 浏览 5 评论 0原文

有一个名为 svnmerge.py 我正在尝试进行一些调整和优化。不过我对 Python 完全陌生，所以这并不容易。

当前的问题似乎与脚本中名为 RevisionSet 的类有关。本质上，它的作用是创建一个包含整数键布尔值的大型哈希表（？）。在最坏的情况下 - 我们的 SVN 存储库中的每个修订版都有一个，目前已接近 75,000 个。

之后，它对如此巨大的数组执行集合运算 - 加法、减法、交集等等。该实现是最简单的 O(n) 实现，在如此大的集合上自然会变得相当慢。由于连续值的跨度很长，因此可以优化整个数据结构。例如，从 1 到 74,000 的所有键都可能包含 true。此外，该脚本是为 Python 2.2 编写的，这是一个相当旧的版本，无论如何我们都在使用 2.6，因此也可以从中获得一些东西。

我可以尝试自己将其拼凑在一起，但这会很困难并且需要花费很多时间 - 更不用说它可能已经在某个地方实现了。虽然我想要学习经历，但现在结果更重要。你建议我做什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小兔几 2024-10-05 17:11:55

您可以尝试使用 numpy 而不是普通的 python 来完成此操作。我发现对于此类操作来说它非常快。

例如：

# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)

# Get all items that are larger than 500, takes 2.58ms
y = x > 500

# Add 10 to those items, takes 26.1ms
x[y] += 10

由于行数更多，我认为 75000 也不应该成为问题:)

You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.

For example:

# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)

# Get all items that are larger than 500, takes 2.58ms
y = x > 500

# Add 10 to those items, takes 26.1ms
x[y] += 10

Since that's with a lot more rows, I think that 75000 should not be a problem either :)

回复收藏 0 原文

陈年往事 2024-10-05 17:11:55

这是 RevisionSet 的快速替换，可使其成为一个集合。它应该快得多。我没有完全测试它，但它适用于我所做的所有测试。毫无疑问还有其他方法可以加快速度，但我认为这确实有帮助，因为它实际上利用了集合的快速实现，而不是在 Python 中执行循环（原始代码在 __sub__ 等函数中执行循环）和__and__。唯一的问题是迭代器没有排序。您可能需要更改一些代码才能解决此问题。我确信还有其他方法可以改进这一点，但希望它会给您一个良好的开始。

class RevisionSet(set):
    """
    A set of revisions, held in dictionary form for easy manipulation. If we
    were to rewrite this script for Python 2.3+, we would subclass this from
    set (or UserSet).  As this class does not include branch
    information, it's assumed that one instance will be used per
    branch.
    """
    def __init__(self, parm):
        """Constructs a RevisionSet from a string in property form, or from
        a dictionary whose keys are the revisions. Raises ValueError if the
        input string is invalid."""


        revision_range_split_re = re.compile('[-:]')

        if isinstance(parm, set):
            print "1"
            self.update(parm.copy())
        elif isinstance(parm, list):
            self.update(set(parm))
        else:
            parm = parm.strip()
            if parm:
                for R in parm.split(","):
                    rev_or_revs = re.split(revision_range_split_re, R)
                    if len(rev_or_revs) == 1:
                        self.add(int(rev_or_revs[0]))
                    elif len(rev_or_revs) == 2:
                        self.update(set(range(int(rev_or_revs[0]),
                                         int(rev_or_revs[1])+1)))
                    else:
                        raise ValueError, 'Ill formatted revision range: ' + R

    def sorted(self):
        return sorted(self)

    def normalized(self):
        """Returns a normalized version of the revision set, which is an
        ordered list of couples (start,end), with the minimum number of
        intervals."""
        revnums = sorted(self)
        revnums.reverse()
        ret = []
        while revnums:
            s = e = revnums.pop()
            while revnums and revnums[-1] in (e, e+1):
                e = revnums.pop()
            ret.append((s, e))
        return ret

    def __str__(self):
        """Convert the revision set to a string, using its normalized form."""
        L = []
        for s,e in self.normalized():
            if s == e:
                L.append(str(s))
            else:
                L.append(str(s) + "-" + str(e))
        return ",".join(L)

添加：
顺便说一句，我比较了原始 RevisionSet 和上面我的 RevisionSet 的并集、交集和减法，当对两个包含 75000 个元素的 RevisionSet 进行操作时，上述代码的速度提高了 3 倍到 7 倍。我知道其他人说 numpy 是正确的选择，但如果您对 Python 不太有经验，正如您的评论所示，那么您可能不想走这条路，因为它将涉及更多的更改。我建议尝试我的代码，看看它是否有效，如果有效，然后看看它对您来说是否足够快。如果不是，那么我会尝试进行分析以查看需要改进的地方。只有那时我才会考虑使用 numpy （这是我经常使用的一个很棒的包）。

Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.

class RevisionSet(set):
    """
    A set of revisions, held in dictionary form for easy manipulation. If we
    were to rewrite this script for Python 2.3+, we would subclass this from
    set (or UserSet).  As this class does not include branch
    information, it's assumed that one instance will be used per
    branch.
    """
    def __init__(self, parm):
        """Constructs a RevisionSet from a string in property form, or from
        a dictionary whose keys are the revisions. Raises ValueError if the
        input string is invalid."""


        revision_range_split_re = re.compile('[-:]')

        if isinstance(parm, set):
            print "1"
            self.update(parm.copy())
        elif isinstance(parm, list):
            self.update(set(parm))
        else:
            parm = parm.strip()
            if parm:
                for R in parm.split(","):
                    rev_or_revs = re.split(revision_range_split_re, R)
                    if len(rev_or_revs) == 1:
                        self.add(int(rev_or_revs[0]))
                    elif len(rev_or_revs) == 2:
                        self.update(set(range(int(rev_or_revs[0]),
                                         int(rev_or_revs[1])+1)))
                    else:
                        raise ValueError, 'Ill formatted revision range: ' + R

    def sorted(self):
        return sorted(self)

    def normalized(self):
        """Returns a normalized version of the revision set, which is an
        ordered list of couples (start,end), with the minimum number of
        intervals."""
        revnums = sorted(self)
        revnums.reverse()
        ret = []
        while revnums:
            s = e = revnums.pop()
            while revnums and revnums[-1] in (e, e+1):
                e = revnums.pop()
            ret.append((s, e))
        return ret

    def __str__(self):
        """Convert the revision set to a string, using its normalized form."""
        L = []
        for s,e in self.normalized():
            if s == e:
                L.append(str(s))
            else:
                L.append(str(s) + "-" + str(e))
        return ",".join(L)

Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).

回复收藏 0 原文