在多线程Python中查找CPU占用插件
我有一个用 python 编写的系统,该系统使用由具有不同经验水平的几位开发人员编写的插件来处理大量数据。
基本上,应用程序启动几个工作线程,然后向它们提供数据。 每个线程确定用于某个项目的插件并要求它处理该项目。 插件只是一个定义了特定函数的Python模块。 处理通常涉及正则表达式,并且不应超过一秒左右。
有时,其中一个插件需要几分钟才能完成,从而使 CPU 始终保持在 100%。 这通常是由次优正则表达式与暴露低效率的数据项配对引起的。
这就是事情变得棘手的地方。 如果我怀疑罪魁祸首是谁,我可以检查其代码并找到问题。 然而,有时我就没那么幸运了。
- 我不能单线程。 如果我这样做,可能需要几周才能重现该问题。
- 在插件上放置计时器并没有帮助,因为当它冻结时,它会占用 GIL,并且所有其他插件也需要几分钟才能完成。
- (如果您想知道,SRE 引擎不会释放 GIL)。
- 据我所知,分析在多线程时非常无用。
除了将整个架构重写为多处理之外,我有什么办法可以找出谁正在吃掉我所有的 CPU?
添加:回答一些评论:
在 python 中分析多线程代码没有用,因为分析器测量总函数时间而不是活动 CPU 时间。 尝试 cProfile.run('time.sleep(3)') 看看我的意思。 (归功于 rog [最后注释])。
单线程之所以棘手是因为 20,000 项中只有 1 项导致了问题,而且我不知道是哪一项。 运行多线程允许我在大约一个小时内处理 20,000 个项目,而单线程可能需要更长的时间(涉及大量网络延迟)。 还有一些更复杂的问题我现在不想讨论。
也就是说,尝试序列化调用插件的特定代码并不是一个坏主意,这样一个插件的时间就不会影响其他插件的时间。 我会尝试一下并报告回来。
I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.
Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.
Occasionally, one of the plugins will take minutes to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.
This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.
- I can't go single threaded. It would probably take weeks to reproduce the problem if I do.
- Putting a timer on the plugin doesn't help, because when it freezes it takes the GIL with it, and all the other plugins also take minutes to complete.
- (In case you were wondering, the SRE engine doesn't release the GIL).
- As far as I can tell profiling is pretty useless when multithreading.
Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?
ADDED: In answer to some of the comments:
Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to rog [last comment]).
The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now.
That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您显然不需要多线程,只需要并发,因为您的线程不共享任何状态:
尝试多处理而不是多线程
单线程/N个子进程。
在那里您可以为每个请求计时,因为没有保留 GIL。
另一种可能性是摆脱多个执行线程并使用基于事件的网络编程(即使用twisted)
You apparently don't need multithreading, only concurrency because your threads don't share any state :
Try multiprocessing instead of multithreading
Single thread / N subprocesses.
There you can time each request, since no GIL is hold.
Other possibility is to get rid of multiple execution threads and use event-based network programming (ie use twisted)
正如你所说,由于 GIL,它不可能在同一进程中。
我建议启动第二个监视器进程,该进程监听原始应用程序中另一个线程的生命周期。 一旦该时间节拍在指定的时间内丢失,监视器就可以终止您的应用程序并重新启动它。
As you said, because of the GIL it is impossible within the same process.
I recommend to start a second monitor process, which listens for life beats from another thread in your original app. Once that time beat is missing for a specified amount of time, the monitor can kill your app and restart it.
如果建议您控制框架,请禁用除一个插件之外的所有插件,然后查看。
基本上如果你有 P1、P2...Pn 插件
运行 N 个进程并在第一个进程中禁用 P1,在第二个进程中禁用 P2,依此类推,
与多线程运行相比,它会快得多,因为没有 GIL 阻塞,您会更快地知道哪个插件是罪魁祸首。
If would suggest as you have control over framework disable all but one plugin and see.
Basically if you have P1, P2...Pn plugins
run N process and disable P1 in first, P2 in second and so on
it would be much faster as compared to your multithreaded run, as no GIL blocking and you will come to know sooner which plugin is the culprit.
我仍然会考虑诺斯克洛的建议。 您可以在单个线程上进行分析以查找该项目,并在很长一段时间内获取转储,并可能找到罪魁祸首。 是的,我知道有 20,000 个项目,需要很长时间,但有时你只能忍耐并找到该死的东西来说服自己问题已被发现并得到解决。 运行脚本,然后去做其他有建设性的事情。 回来分析结果。 这有时就是男人和男孩的区别;-)
或/并且,添加日志信息来跟踪每个插件处理每个项目的时间。 查看程序运行结束时的日志数据,看看哪一个程序比其他程序运行时间要长得多。
I'd still look at nosklo's suggestion. You could profile on a single thread to find the item, and get the dump at your very long run an possibly see the culprit. Yeah, I know it's 20,000 items and will take a long time, but sometimes you just got to suck it up and find the darn thing to convince yourself the problem is caught and taken care of. Run the script, and go work on something else constructive. Come back and analyze results. That's what separates the men from the boys sometimes;-)
Or/And, add logging information that tracks the time to execute each item as it is processed from each plugin. Look at the log data at the end of your program being run, and see which one took an awful long time to run compared to the others.