在 Django/Postgresql 中调试活锁

发布于 2024-08-17 04:00:56 字数 1234 浏览 16 评论 0原文

我在 Django 上使用 Apache2、mod_python 和带有 postgresql_psycopg2 数据库后端的 PostgreSQL 8.3 运行一个相当流行的 Web 应用程序。我偶尔会遇到活锁，当 apache2 进程持续消耗 99% 的 CPU 几分钟或更长时间时，就会出现这种情况。

我在 apache2 进程上执行了 strace -ppid ，发现它不断重复这些系统调用：

sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143

这个确切的片段在跟踪中不断重复，并且在我最终杀死之前运行了 10 多分钟apache2 进程。（注意：我编辑此代码是为了用一个新的 strace 片段替换我之前的 strace 片段，该片段显示完整的字符串内容而不是被截断。）

我对上述内容的解释是 django 正在尝试对我的表 account_profile 进行存在性检查，但在早些时候（在我开始跟踪之前）出现了问题（SQL 解析错误？引用完整性或唯一性约束违规？谁知道？），现在 Postgresql 返回错误“当前事务已中止”。由于某种原因，它不会引发异常并放弃，而是不断重试。

一种可能性是，这是在调用 Profile.objects.get_or_create 时触发的。这是映射到 account_profile 表的模型类。也许 get_or_create 中的某些内容旨在捕获过于广泛的异常集并重试？从 Web 服务器日志来看，此活锁可能是由于双击我的网站注册表单中的 POST 按钮而发生的。

在过去的几天里，这种情况在实时站点上发生了几次，并导致速度显着减慢，直到我介入为止，因此除了无限死锁之外，几乎任何其他方法都会有所改善！ :)

原文

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.

I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:

sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143

This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)

My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.

One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.

This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)

分享到QQ

分享到微博