我应该在 ZeroMQ 程序初始化中添加睡眠以避免 heisenbugs 吗?

发布于 2025-01-03 12:33:49 字数 623 浏览 0 评论 0原文

我正在研究一个 ZeroMQ 概念证明,其中涉及一个主进程,该进程发布控制命令,并从任意数量的工作进程中推送和提取数据。

似乎在初始化时,如果我使用 shell 脚本启动主进程和工作进程(单独的进程),有时会失去同步。但是,如果我以任何顺序手动启动它们(在单独的控制台窗口中),我从未见过这种情况。我开始考虑在每个进程绑定/连接到套接字后添加 sleep() 以避免这种明显的 heisenbug - 但我也想知道我是否只是愚蠢。有什么建议吗?

下面是偶尔失败的 shell 脚本的样子。主节点使用 PUB 和 PUSH 与工作线程通信,并使用 PULL 套接字获取信息。我认为 heisenbug 是由于有时其中一名工作人员看不到来自主机的 PUB 消息而引起的。

echo "starting worker A in background"
python pWorkerA.py > /tmp/A.out &
echo "starting worker B in background"
python pWorkerB.py > /tmp/B.out &
echo "starting master"
python abMaster.py

如果我使用 sleep() 我觉得我在作弊

I'm working on a zeroMQ proof of concept that involves a master process which publishes control commands and also pushes and pulls data from any number of worker processes.

It seems that on initialization the master and workers (separate processes) sometimes get out of sync if I start them up using a shell script. However, I've never seen this if I start them up in any order manually (in separate console windows). I'm beginning to consider adding a sleep() after each process binds/connects to the sockets to avoid this apparent heisenbug -- but I'm also wondering if I'm just being stupid. Any advice?

Here is what the shell script that occasionally fails looks like. The master talks to the workers using both a PUB and a PUSH and also gets info back using a PULL socket. I think the heisenbug is caused when a PUB message from the master sometimes is not seen by one of the workers.

echo "starting worker A in background"
python pWorkerA.py > /tmp/A.out &
echo "starting worker B in background"
python pWorkerB.py > /tmp/B.out &
echo "starting master"
python abMaster.py

I feel like I'm cheating if I use sleep()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

很快妥协 2025-01-10 12:33:49

您必须假设在 PUB 上发送的消息在建立连接之前不会到达 SUB 套接字。建立连接需要一些有限的时间(即使非常短),因此在该小窗口中发送的任何消息都不会到达尚未连接的 SUB。正如您所建议的,避免这种情况的一个简单方法是在绑定后向主机添加睡眠。这并不完全可靠,因为从技术上讲,工作人员的连接速度可能非常慢,或者在主设备之后启动,并且在成功时没有实际信号。

如果您确实需要确认工作人员已连接,则更可靠的方法是采用握手机制,以便工作人员在连接后向主机发送一条小“嗨,我准备好了”消息(在不同的通道上)。然后,主服务器仅在收到必要数量的握手后才开始发布消息(取决于应用程序的适当逻辑)。

You have to assume that messages sent on PUB will not arrive on SUB sockets until they have established their connections. Establishing connections takes some finite, if very small, amount of time, so any messages sent in that small window will not arrive on SUBs that haven't yet connected. An easy way to avoid this is, as you have suggested, adding a sleep to the master after binding. This is not perfectly reliable, as the workers could technically be super-slow to connect, or be started after the master, and there is no actual signal when they succeed.

A more reliable approach, if you do need to confirm that workers have connected, is to have a handshake mechanism, such that workers send a small "Hi, I'm ready" message (on a different channel) to the master after connecting. Then, the master only starts publishing messages after it has received the necessary number of handshakes (depending on the appropriate logic for your application).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文