长时间运行的数据处理Python脚本中的程序结构
对于我当前的工作,我正在编写一些长时间运行(考虑几个小时到几天)的脚本来执行 CPU 密集型数据处理。程序流程非常简单 - 它进入主循环,完成主循环,保存输出并终止: 我的程序的基本结构往往是这样的:
<import statements>
<constant declarations>
<misc function declarations>
def main():
for blah in blahs():
<lots of local variables>
<lots of tightly coupled computation>
for something in somethings():
<lots more local variables>
<lots more computation>
<etc., etc.>
<save results>
if __name__ == "__main__":
main()
这很快就会变得难以管理,所以我想将其重构为更多内容易于管理。我想让它更易于维护,而不牺牲执行速度。
然而,每个代码块都依赖于大量变量,因此将部分计算重构为函数将使参数列表很快失控。我应该将这种代码放入Python类中,并将局部变量更改为类变量吗?从概念上讲,将程序转换为类并没有多大意义,因为该类永远不会被重用,并且每个实例只会创建一个实例。
此类计划的最佳实践结构是什么?我正在使用 python,但问题相对与语言无关,假设现代面向对象的语言功能。
For my current job I am writing some long-running (think hours to days) scripts that do CPU intensive data-processing. The program flow is very simple - it proceeds into the main loop, completes the main loop, saves output and terminates: The basic structure of my programs tends to be like so:
<import statements>
<constant declarations>
<misc function declarations>
def main():
for blah in blahs():
<lots of local variables>
<lots of tightly coupled computation>
for something in somethings():
<lots more local variables>
<lots more computation>
<etc., etc.>
<save results>
if __name__ == "__main__":
main()
This gets unmanageable quickly, so I want to refactor it into something more manageable. I want to make this more maintainable, without sacrificing execution speed.
Each chuck of code relies on a large number of variables however, so refactoring parts of the computation out to functions would make parameters list grow out of hand very quickly. Should I put this sort of code into a python class, and change the local variables into class variables? It doesn't make a great deal of sense tp me conceptually to turn the program into a class, as the class would never be reused, and only one instance would ever be created per instance.
What is the best practice structure for this kind of program? I am using python but the question is relatively language-agnostic, assuming a modern object-oriented language features.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,如果您的程序要运行几个小时/几天,那么切换到使用类/方法而不是将所有内容都放在一个巨大的主程序中的开销几乎不存在。
此外,从长远来看,重构(即使它确实涉及传递大量变量)应该可以帮助您提高速度。对设计良好的应用程序进行分析要容易得多,因为您可以查明缓慢的部分并在那里进行优化。也许会出现一个针对您的计算进行了高度优化的新库...设计良好的程序将让您立即将其插入并进行测试。或者您可能决定编写 C 模块扩展来提高计算子集的速度,设计良好的应用程序也将使这变得容易。
如果没有看到
<大量紧密耦合的计算>
和<大量更多的计算>
,很难给出具体的建议。但是,我将从让每个for
块成为它自己的方法开始,然后从那里开始。First off, if your program is going to be running for hours/days then the overhead of switching to using classes/methods instead of putting everything in a giant main is pretty much non-existent.
Additionally, refactoring (even if it does involve passing a lot of variables) should help you improve speed in the long run. Profiling an application which is designed well is much easier because you can pin-point the slow parts and optimize there. Maybe a new library comes along that's highly optimized for your calculations... a well designed program will let you plug it in and test right away. Or perhaps you decide to write a C Module extension to improve the speed of a subset of your calculations, a well designed application will make that easy too.
It's hard to give concrete advice without seeing
<lots of tightly coupled computation>
and<lots more computation>
. But, I would start with making everyfor
block it's own method and go from there.不太干净,但在小项目中效果很好...
您可以开始使用模块,就好像它们是单例实例一样,并且仅当您觉得模块的复杂性或计算证明它们合理时才创建真正的类。
如果你这样做,你会想要使用“导入模块”而不是“从模块导入东西”——它更干净,如果可以重新分配“东西”,效果会更好。此外,Google 指南中也推荐这样做。
Not too clean, but works well in little projects...
You can start using modules as if they were singleton instances, and create real classes only when you feel the complexity of the module or the computation justifies them.
If you do that, you would want to use "import module" and not "from module import stuff" -- it's cleaner and will work better if "stuff" can be reassigned. Besides, it's recommended in the Google guidelines.
使用一个或多个类可以帮助您组织代码。
形式的简单性(例如通过使用类属性和方法)很重要,因为它可以帮助您了解算法,并可以帮助您更轻松地对各个部分进行单元测试。
在我看来,这些好处远远超过了使用 OOP 可能带来的轻微速度损失。
Using a class (or classes) can help you organize your code.
Simplicity of form (such as through use of class attributes and methods) is important because it helps you see your algorithm, and can help you more easily unit test the parts.
IMO, these benefits far outweigh the slight loss of speed that may come with using OOP.