在 Django/Postgresql 中调试活锁

发布于 2024-08-17 04:00:56 字数 1234 浏览 10 评论 0原文

我在 Django 上使用 Apache2、mod_python 和带有 postgresql_psycopg2 数据库后端的 PostgreSQL 8.3 运行一个相当流行的 Web 应用程序。我偶尔会遇到活锁,当 apache2 进程持续消耗 99% 的 CPU 几分钟或更长时间时,就会出现这种情况。

我在 apache2 进程上执行了 strace -ppid ,发现它不断重复这些系统调用:

sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143

这个确切的片段在跟踪中不断重复,并且在我最终杀死之前运行了 10 多分钟apache2 进程。 (注意:我编辑此代码是为了用一个新的 strace 片段替换我之前的 strace 片段,该片段显示完整的字符串内容而不是被截断。)

我对上述内容的解释是 django 正在尝试对我的表 account_profile 进行存在性检查,但在早些时候(在我开始跟踪之前)出现了问题(SQL 解析错误?引用完整性或唯一性约束违规?谁知道?),现在 Postgresql 返回错误“当前事务已中止”。由于某种原因,它不会引发异常并放弃,而是不断重试。

一种可能性是,这是在调用 Profile.objects.get_or_create 时触发的。这是映射到 account_profile 表的模型类。也许 get_or_create 中的某些内容旨在捕获过于广泛的异常集并重试?从 Web 服务器日志来看,此活锁可能是由于双击我的网站注册表单中的 POST 按钮而发生的。

在过去的几天里,这种情况在实时站点上发生了几次,并导致速度显着减慢,直到我介入为止,因此除了无限死锁之外,几乎任何其他方法都会有所改善! :)

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.

I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:

sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143

This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)

My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.

One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.

This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

§对你不离不弃 2024-08-24 04:00:56

事实证明这完全是我的错。我找到了 select (1) as 'a' 语句似乎起源的地方(在 django/models/base.py 中)并对其进行了攻击以记录回溯,这清楚地指出了我的代码。

我有一些代码为每个配置文件组成一个唯一的电子邮件“密钥”。这些键是随机生成的,因此由于存在重叠的可能性,我在 while 循环内的 try/ except 中运行它。我的假设是,如果密钥不唯一,数据库的唯一约束将导致保存失败,并且我可以重试。

不幸的是,在 Postgresql 中,您不能在出现完整性错误后简单地重试。在重试之前,您必须发出 COMMIT 或 ROLLBACK 命令(显然,即使您处于自动提交模式)。所以我有一个无限循环失败的保存尝试,我忽略了错误消息。

现在,我查找更具体的异常 (django.db.IntegrityError) 并运行有限次数的尝试,以便循环不是无限的。

感谢大家的观看/回答。

This turned out to be entirely my fault. I found the spot where the select (1) as 'a' statement seemed to originate (in django/models/base.py) and hacked it to log a traceback, which pointed clearly at my code.

I had some code that makes up a unique email "key" for each Profile. These keys are randomly generated, so because there is some possibility of overlap, I run it in a try/except within a while loop. My assumption was that the database's unique constraint would cause the save to fail if the key was not unique, and I'd be able to try again.

Unfortunately, in Postgresql you cannot simply try again after an integrity error. You have to issue a COMMIT or ROLLBACK command (even if you're in autocommit mode, apparently) before you can try again. So I had an infinite loop of failing save attempts where I was ignoring the error message.

Now I look for a more specific exception (django.db.IntegrityError) and run a limited number of attempts so that the loop is not infinite.

Thanks to everyone for viewing/answering.

生生漫 2024-08-24 04:00:56

你的分析听起来很不错。显然它没有意识到交易被中止的事实。我建议你将此作为错误报告给 django 项目......

Your analysis sounds pretty good. Clearly it's not picking up the fact that the transaction is aborted. I suggest you report this as a bug to the django project...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文