celery .delay 挂起(最近,不是身份验证问题)

发布于 2024-11-15 12:08:57 字数 3485 浏览 4 评论 0原文

我正在运行 Celery 2.2.4/djCelery 2.2.4,使用 RabbitMQ 2.1.1 作为后端。我最近上线了两台新的 celery 服务器——我一直在两台机器上运行 2 个工作线程,总共有大约 18 个线程,在我的新增强的盒子(36g RAM + 双超线程四核)上,我正在运行 10 个线程。每个工人有 8 个线程,总共 180 个线程——我的任务都很小,所以这应该没问题。

过去几天节点一直运行良好,但今天我注意到 .delaay() 挂起。当我中断它时,我看到一个指向此处的回溯:

File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
    return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
    publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
    self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
    (20, 41),    # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
    self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
    frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
    frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
    s = self.sock.recv(65536)

我已经检查了 Rabbit 日志,并且我看到尝试连接的进程为:

=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569

我将 Celery 日志级别设置为 INFO,但是我在 Celery 日志中没有看到任何特别有趣的东西,除了 2 个工作进程无法连接到代理:

[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...

所有其他节点都能够毫无问题地连接。

我知道有一个帖子( RabbitMQ / Celery with Django 挂在延迟/就绪/等上 - 没有有用的日志信息)去年有类似的性质,但我很确定这是不同的。难道是工作人员的绝对数量在 amqplib 中造成了某种竞争条件 - 我发现this 线程似乎表明 amqplib 不是线程安全的,不确定这对 Celery 是否重要。

编辑:我已经在两个节点上尝试了celeryctl purge——在一个节点上它成功了,但在另一个节点上它失败并出现以下AMQP错误:

AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
    amqplib.client_0_8.exceptions.AMQPConnectionException: 
    (530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX' 
     with different type, durable or autodelete   value", (40, 10), 'Channel.exchange_declare')

在两个节点上,inspect stats 因上面的“无法关闭连接”回溯而挂起。我在这里不知所措。

编辑2:我能够使用camqadm中的exchange.delete删除有问题的交换,现在第二个节点也挂起:(。

EDIT3:最近发生的一件事是我向rabbitmq添加了一个额外的虚拟主机,我的临时节点连接到它。

I am running Celery 2.2.4/djCelery 2.2.4, using RabbitMQ 2.1.1 as a backend. I recently brought online two new celery servers -- I had been running 2 workers across two machines with a total of ~18 threads and on my new souped up boxes (36g RAM + dual hyper-threaded quad-core), I am running 10 workers with 8 threads each, for a total of 180 threads -- my tasks are all pretty small so this should be fine.

The nodes have been running fine for the last few days, but today I noticed that .delaay() is hanging. When I interrupt it, I see a traceback that points here:

File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
    return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
    publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
    self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
    (20, 41),    # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
    self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
    frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
    frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
    s = self.sock.recv(65536)

I've checked the Rabbit logs, and I see it the process trying to connect as:

=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569

I have my Celery log level set to INFO, but I don't see anything particularly interesting in the Celery logs EXCEPT that 2 of the workers can't connect to the broker:

[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...

All of the other nodes are able to connect without issue.

I know that there was a posting ( RabbitMQ / Celery with Django hangs on delay/ready/etc - No useful log info ) last year of a similar nature, but I'm pretty certain that this is different. Could it be that the sheer number of workers is creating some sort of a race condition in amqplib -- I found this thread which seems to indicate that amqplib is not thread-safe, not sure if this matters for Celery.

EDIT: I've tried celeryctl purge on both nodes -- on one it succeeds, but on the other it fails with the following AMQP error:

AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
    amqplib.client_0_8.exceptions.AMQPConnectionException: 
    (530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX' 
     with different type, durable or autodelete   value", (40, 10), 'Channel.exchange_declare')

On both nodes, inspect stats hangs with the "can't close connection" traceback above. I'm at a loss here.

EDIT2: I was able to delete the offending exchange using exchange.delete from camqadm and now the second node hangs too :(.

EDIT3: One thing that also recently changed is that I added an additional vhost to rabbitmq, which my staging node connects to.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

戒ㄋ 2024-11-22 12:08:57

希望这会节省一些人很多时间......尽管它肯定不会让我避免任何尴尬:

运行兔子的服务器上的 /var 已满。对于我添加的所有节点,rabbit 进行了更多的日志记录,并且填满了 /var ——我无法写入 /var/lib/rabbitmq ,所以没有消息通过。

Hopefully this will save somebody a lot of time...though it certainly does not save me any embarrassment:

/var was full on the server that was running rabbit. With all of the nodes that I added, rabbit was doing a lot more logging and it filled up /var -- I couldn't write to /var/lib/rabbitmq, and so no messages were going through.

注定孤独终老 2024-11-22 12:08:57

我有相同的症状,但原因不同,对于其他偶然发现此问题的人,我的问题已通过 https://stackoverflow 解决.com/a/63591450/284164 -- 我没有在项目级别导入 celery 应用程序,并且 .delay() 一直挂起,直到我补充说。

I had the same symptoms, but not the same cause, for anyone else who stumbles up on this, mine was solved by https://stackoverflow.com/a/63591450/284164 -- I wasn't importing the celery app at the project level, and .delay() was hanging until I added that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文