Python 是否优化循环中的函数调用?
比如说,我有一段代码从循环中调用某个函数数百万次,并且我希望代码能够快速:
def outer_function(file):
for line in file:
inner_function(line)
def inner_function(line):
# do something
pass
它不一定是文件处理,它可以是例如从函数绘图线调用的函数绘图点。这个想法是,逻辑上这两者必须分开,但从性能的角度来看,它们应该尽可能快地一起行动。
Python 会自动检测并优化这些东西吗?如果没有 - 有没有办法给它一个线索来这样做?也许使用一些额外的外部优化器?...
Say, I have a code which calls some function millions time from loop and I want the code to be fast:
def outer_function(file):
for line in file:
inner_function(line)
def inner_function(line):
# do something
pass
It's not necessarily a file processing, it could be for example a function drawing point called from function drawing line. The idea is that logically these two have to be separated, but from performance point of view they should act together as fast as possible.
Does Python detects and optimizes such things automatically? If not - is there a way to give it a clue to do so? Use some additional external optimizer maybe?...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于 Python 的动态特性,Python 不进行内联函数调用。理论上,
inner_function
可以执行一些操作,将名称inner_function
重新绑定到其他内容 - Python 在编译时无法知道这可能会发生。例如:印刷品:
您可能认为这很可怕。然后,再想一想——Python 的函数式和动态特性是其最吸引人的特性之一。 Python 允许的很多功能都是以性能为代价的,在大多数情况下这是可以接受的。
也就是说,您可能可以使用 byteplay 或类似工具将某些东西组合在一起 - 拆解内部将函数转换为字节码并将其插入到外部函数中,然后重新组装。再想一想,如果您的代码对性能至关重要,足以保证此类黑客攻击,只需用 C 重写即可。Python 对于 FFI 有很好的选择。
这都与官方 CPython 实现相关。运行时 JITting 解释器(如 PyPy 或不幸已不复存在的 Unladen Swallow)理论上可以检测正常情况并执行内联。唉,我对 PyPy 不太熟悉,不知道它是否能做到这一点,但它绝对可以。
Python does not inline function calls, because of its dynamic nature. Theoretically,
inner_function
can do something that re-binds the nameinner_function
to something else - Python has no way to know at compile time this might happen. For example:Prints:
You may think this is horrible. Then, think again - Python's functional and dynamic nature is one of its most appealing features. A lot of what Python allows comes at the cost of performance, and in most cases this is acceptable.
That said, you can probably hack something together using a tool like byteplay or similar - disassemble the inner function into bytecode and insert it into the outer function, then reassemble. On second thought, if your code is performance-critical enough to warrant such hacks, just rewrite it in C. Python has great options for FFI.
This is all relevant to the official CPython implementation. A runtime-JITting interpreter (like PyPy or the sadly defunct Unladen Swallow) can in theory detect the normal case and perform inlining. Alas, I'm not familiar enough with PyPy to know whether it does this, but it definitely can.
哪种 Python? PyPy 的 JIT 编译器将在几百次或十几次(取决于每次迭代执行了多少操作码)迭代之后开始跟踪执行,一路上忘记 Python 函数调用,并将收集到的信息编译成一段优化的机器代码可能没有任何使函数调用本身发生的逻辑残余。跟踪是线性的,JIT 的后端甚至不知道有一个函数调用,它只是看到两个函数的指令在执行时混合在一起。 (这是完美的情况,例如,当循环中有分支或所有迭代都采用相同的分支时。某些代码不适合这种 JIT 编译,并且在产生很大的加速之前使跟踪快速无效,尽管这相当多很少见。)
现在,CPython(许多人在谈到“Python”或 Python 解释器时的意思)并不是那么聪明。它是一个简单的字节码虚拟机,将尽职尽责地执行与在每次迭代中一次又一次调用函数相关的逻辑。但话又说回来,如果性能那么很重要,那么为什么还要使用解释器呢?如果将此类开销保持在尽可能低的水平非常重要,请考虑在本机代码中(例如作为 C 扩展或在 Cython 中)编写热循环。
除非每次迭代只进行少量的数字运算,否则无论如何都不会获得很大的改进。
Which Python? PyPy's JIT-compiler will - after a few hundred or dozen (depends on how many opcodes are executed on each iteration) iterations or so - start tracing execution, forget about Python function calls along the way, and compile the gathered information into a piece of optimized machine code which likely doesn't have any remnant of the logic that made the function call itself happen. Traces are linear, the JIT's backend doesn't even know there was a function call, it just sees the instructions from both functions mixed together as they were executed. (This is the perfect case, when e.g. there is branching in the loop or all iterations take the same branch. Some code is unsuited to this kind of JIT-compilation and invalidates the traces quickly, before they yield much speedup, although this is rather rare.)
Now, CPython, what many people mean when they speak of "Python" or the Python interpreter, isn't that clever. It's a straightforward bytecode VM and will dutifully execute the logic associated with calling a function again and again in each iteration. But then again, why are you using an interpreter anyway if performance is that important? Consider writing that hot loop in native code (e.g. as a C extension or in Cython) if it's that important to keep such overhead as low as humanly possible.
Unless you're doing only a tiny bit of number crunching per iteration, you won't get large improvements either way though.
如果“Python”是指CPython,通常使用的实现,则不是。
如果“Python”恰好是指 Python 语言的任何实现,那么是的。 PyPy 可以优化很多,我相信它的方法 JIT 应该处理这样的情况。
If by "Python" you mean CPython, the generally used implementation, no.
If by "Python" you happened to mean any implementation of the Python language, yes. PyPy can optimise a lot and I believe its method JIT should take care of cases like this.
CPython(“标准”Python 实现)不进行这种优化。
但请注意,如果您正在计算函数调用的 CPU 周期,那么对于您的问题,CPython 可能不是正确的工具。如果您 100% 确定要使用的算法已经是最好的算法(这是最重要的事情),并且您的计算确实受 CPU 限制,那么选项例如:
CPython (the "standard" python implementation) doesn't do this kind of optimization.
Note however that if you are counting the CPU cycles of a function call then probably for your problem CPython is not the correct tool. If you are 100% sure that the algorithm you are going to use is already the best one (this is the most important thing), and that your computation is really CPU bound then options are for example:
调用函数来调用 pass 语句显然会带来相当高的 (∞) 开销。您的实际程序是否会遭受过度的开销取决于内部函数的大小。如果它确实只是设置一个像素,那么我建议采用一种不同的方法,即使用以 C 或 C++ 等本机语言编码的绘图基元。
Python 有(有些实验性的)JIT 编译器可以优化函数调用,但主流 Python 不会这样做。
Calling a function to invoke the
pass
statement obviously carries a fairly high (∞) overhead. Whether your real program suffers undue overhead depends on the size of the inner function. If it really is just setting a pixel, then I'd suggest a different approach that uses drawing primitives coded in a native language like C or C++.There are (somewhat experimental) JIT compilers for Python that will optimise function calls, but mainstream Python won't do this.