如何在 Python 中优化大型(75,000 个项目)布尔值集的操作?
有一个名为 svnmerge.py 我正在尝试进行一些调整和优化。不过我对 Python 完全陌生,所以这并不容易。
当前的问题似乎与脚本中名为 RevisionSet
的类有关。本质上,它的作用是创建一个包含整数键布尔值的大型哈希表(?)。在最坏的情况下 - 我们的 SVN 存储库中的每个修订版都有一个,目前已接近 75,000 个。
之后,它对如此巨大的数组执行集合运算 - 加法、减法、交集等等。该实现是最简单的 O(n) 实现,在如此大的集合上自然会变得相当慢。由于连续值的跨度很长,因此可以优化整个数据结构。例如,从 1 到 74,000 的所有键都可能包含 true
。此外,该脚本是为 Python 2.2 编写的,这是一个相当旧的版本,无论如何我们都在使用 2.6,因此也可以从中获得一些东西。
我可以尝试自己将其拼凑在一起,但这会很困难并且需要花费很多时间 - 更不用说它可能已经在某个地方实现了。虽然我想要学习经历,但现在结果更重要。你建议我做什么?
There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet
in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true
. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以尝试使用 numpy 而不是普通的 python 来完成此操作。我发现对于此类操作来说它非常快。
例如:
由于行数更多,我认为 75000 也不应该成为问题:)
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
这是 RevisionSet 的快速替换,可使其成为一个集合。它应该快得多。我没有完全测试它,但它适用于我所做的所有测试。毫无疑问还有其他方法可以加快速度,但我认为这确实有帮助,因为它实际上利用了集合的快速实现,而不是在 Python 中执行循环(原始代码在 __sub__ 等函数中执行循环)和
__and__
。唯一的问题是迭代器没有排序。您可能需要更改一些代码才能解决此问题。我确信还有其他方法可以改进这一点,但希望它会给您一个良好的开始。添加:
顺便说一句,我比较了原始 RevisionSet 和上面我的 RevisionSet 的并集、交集和减法,当对两个包含 75000 个元素的 RevisionSet 进行操作时,上述代码的速度提高了 3 倍到 7 倍。我知道其他人说 numpy 是正确的选择,但如果您对 Python 不太有经验,正如您的评论所示,那么您可能不想走这条路,因为它将涉及更多的更改。我建议尝试我的代码,看看它是否有效,如果有效,然后看看它对您来说是否足够快。如果不是,那么我会尝试进行分析以查看需要改进的地方。只有那时我才会考虑使用 numpy (这是我经常使用的一个很棒的包)。
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like
__sub__
and__and__
. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
为什么不处理子集?到最后只有 74001。
修剪 74/75 的数据比尝试编写比 O(n) 更聪明的算法要容易得多。
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
您应该重写 RevisionSet 以拥有一组修订。我认为修订版的内部表示应该是整数,并且应根据需要创建修订版范围。
没有令人信服的理由来使用支持 python 2.3 及更早版本的代码。
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
只是一个想法。我曾经在二进制图像处理中使用运行编码来完成这种事情。也就是说,将每个集合存储为一系列数字:关闭的位数、打开的位数、关闭的位数等。
然后您可以对它们进行各种布尔运算,作为简单合并算法的装饰。
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.