使用队列或 REST Web 服务协调分布式 Python 进程

发布于 2024-12-18 15:35:25 字数 1337 浏览 6 评论 0原文

服务器 A 有一个进程将 n 个数据库表导出为平面文件。 服务器 B 包含一个实用程序，可将平面文件加载到 DW 设备数据库中。

服务器 A 上运行一个进程，导出并压缩大约 50-75 个表。每次导出表并生成文件时，也会生成一个 .flag 文件。

服务器 B 有一个 bash 进程，它会重复检查服务器 A 生成的每个 .flag 文件。它通过连接到 A 并检查文件是否存在来实现此目的。如果标志文件存在，服务器 B 将从服务器 A scp 文件，解压缩它，并将其加载到分析数据库中。如果该文件尚不存在，它将休眠 n 秒并重试。对于服务器 B 期望在服务器 A 上找到的每个表/文件重复此过程。该过程串行执行，一次处理一个文件。

此外：在服务器 A 上运行的进程无法将文件“推送”到服务器 B。由于文件大小和地理问题，服务器 A 无法将平面文件加载到 DW 设备中。

我发现这个过程很麻烦，而且恰好需要重写/修改。我建议采用基于消息传递的解决方案。我最初认为这对于 RabbitMQ（或类似的）来说是一个很好的选择，其中

服务器 A 将写入一个文件，压缩它，然后为队列生成一条消息。
服务器 B 将订阅队列并处理消息正文中指定的文件。

我认为基于消息传递的方法不仅可以节省时间，因为它可以消除每个表的检查-等待-重复周期，而且还允许我们并行运行进程（因为没有依赖关系）。

我向我的团队展示了使用 RabbitMQ 的概念验证，他们都愿意使用消息传递。他们中的许多人很快就发现了我们可以从基于消息的处理中受益的其他机会。我们将从实现消息传递中受益的领域之一是实时填充我们的 DW 维度，而不是通过批量填充。

然后我突然想到，鉴于容量较小（50-75 个任务），基于 MQ 的解决方案可能有点过分了。考虑到我们的运营团队必须安装 RabbitMQ（及其依赖项，包括 Erlang），这可能有点过头了，并且会带来新的管理难题。

然后我意识到使用基于 REST 的解决方案可以使这变得更加简单。服务器 A 可以生成一个文件，然后对服务器 B 上的简单 (web.py) Web 服务进行 HTTP 调用。然后，服务器 B 可以根据调用的 URL 启动传输和加载过程。考虑到传输、解压缩和加载每个文件所需的时间，我可能会使用 Python 的多重处理来创建一个加载每个文件的子进程。

我认为基于 REST 的解决方案是一个想法，因为它更简单。在我看来，使用 MQ 更适合大容量任务，但我们（目前）仅谈论 50-75 次操作，未来可能还会有更多操作。

考虑到我的要求和数量，基于 REST 是否是一个好的解决方案？是否有其他框架或 OSS 产品已经做到了这一点？我希望在不造成其他管理和开发麻烦的情况下添加消息传递。

原文

Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.

A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.

Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.

Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.

I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where

Server A would write a file, compress it and then produce a message for a queue.
Server B would subscribe to the queue and would process files named in the message body.

I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).

I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.

It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.

I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.

I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.

Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.

分享到QQ

分享到微博