使用队列或 REST Web 服务协调分布式 Python 进程

发布于 2024-12-18 15:35:25 字数 1337 浏览 0 评论 0原文

服务器 A 有一个进程将 n 个数据库表导出为平面文件。 服务器 B 包含一个实用程序,可将平面文件加载到 DW 设备数据库中。

服务器 A 上运行一个进程,导出并压缩大约 50-75 个表。每次导出表并生成文件时,也会生成一个 .flag 文件。

服务器 B 有一个 bash 进程,它会重复检查服务器 A 生成的每个 .flag 文件。它通过连接到 A 并检查文件是否存在来实现此目的。如果标志文件存在,服务器 B 将从服务器 A scp 文件,解压缩它,并将其加载到分析数据库中。如果该文件尚不存在,它将休眠 n 秒并重试。对于服务器 B 期望在服务器 A 上找到的每个表/文件重复此过程。该过程串行执行,一次处理一个文件。

此外:在服务器 A 上运行的进程无法将文件“推送”到服务器 B。由于文件大小和地理问题,服务器 A 无法将平面文件加载到 DW 设备中。

我发现这个过程很麻烦,而且恰好需要重写/修改。我建议采用基于消息传递的解决方案。我最初认为这对于 RabbitMQ(或类似的)来说是一个很好的选择,其中

  • 服务器 A 将写入一个文件,压缩它,然后为队列生成一条消息。

  • 服务器 B 将订阅队列并处理消息正文中指定的文件。

我认为基于消息传递的方法不仅可以节省时间,因为它可以消除每个表的检查-等待-重复周期,而且还允许我们并行运行进程(因为没有依赖关系)。

我向我的团队展示了使用 RabbitMQ 的概念验证,他们都愿意使用消息传递。他们中的许多人很快就发现了我们可以从基于消息的处理中受益的其他机会。我们将从实现消息传递中受益的领域之一是实时填充我们的 DW 维度,而不是通过批量填充。

然后我突然想到,鉴于容量较小(50-75 个任务),基于 MQ 的解决方案可能有点过分了。考虑到我们的运营团队必须安装 RabbitMQ(及其依赖项,包括 Erlang),这可能有点过头了,并且会带来新的管理难题。

然后我意识到使用基于 REST 的解决方案可以使这变得更加简单。服务器 A 可以生成一个文件,然后对服务器 B 上的简单 (web.py) Web 服务进行 HTTP 调用。然后,服务器 B 可以根据调用的 URL 启动传输和加载过程。考虑到传输、解压缩和加载每个文件所需的时间,我可能会使用 Python 的多重处理来创建一个加载每个文件的子进程。

我认为基于 REST 的解决方案是一个想法,因为它更简单。在我看来,使用 MQ 更适合大容量任务,但我们(目前)仅谈论 50-75 次操作,未来可能还会有更多操作。

考虑到我的要求和数量,基于 REST 是否是一个好的解决方案?是否有其他框架或 OSS 产品已经做到了这一点?我希望在不造成其他管理和开发麻烦的情况下添加消息传递。

Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.

A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.

Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.

Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.

I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where

  • Server A would write a file, compress it and then produce a message for a queue.

  • Server B would subscribe to the queue and would process files named in the message body.

I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).

I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.

It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.

I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.

I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.

Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

压抑⊿情绪 2024-12-25 15:35:25

诸如 Rabbit 之类的消息代理包含针对许多问题的实用解决方案:

  • 支持多个生产者和消费者,而不存在消息重复的风险
  • 原子性和工作单元逻辑提供事务完整性,防止在发生
  • 水平 故障时消息重复和丢失扩展——大多数成熟的代理都可以集群化,以便单个队列存在于多台机器上
  • 无集合点消息传递——发送方和接收方不需要同时运行,因此可以在不影响的情况下将其中一个停止进行维护另一种
  • 保存先进先出顺序

根据您正在考虑的特定 Web 服务平台,您可能会发现您需要其中一些功能,并且如果不使用代理,则必须自己实现它们。 HTTP、SOAP、JSON 等 Web 服务协议和格式并不能为您解决这些问题。

在我之前的工作中,项目管理很早就开始使用消息代理,但后来团队最终实现了快速而肮脏的逻辑,旨在解决与上面我们的 Web 服务架构中相同的问题。由于我们正在修复大量并发和错误恢复问题,因此我们提供业务价值的时间较少。

因此,虽然消息代理表面上看起来像是一个重量级解决方案,并且实际上可能超出您现在的需要,但它确实具有许多您稍后可能需要但尚未意识到的好处。

Message brokers such as Rabbit contain practical solutions for a number of problems:

  • multiple producers and consumers are supported without risk of duplication of messages
  • atomicity and unit-of-work logic provide transactional integrity, preventing duplication and loss of messages in the event of failure
  • horizontal scaling--most mature brokers can be clustered so that a single queue exists on multiple machines
  • no-rendezvous messaging--it is not necessary for sender and receiver to be running at the same time, so one can be brought down for maintenance without affecting the other
  • preservation of FIFO order

Depending on the particular web service platform you are considering, you may find that you need some of these features and must implement them yourself if not using a broker. The web service protocols and formats such as HTTP, SOAP, JSON, etc. do not solve these problems for you.

In my previous job the project management passed on using message brokers early on, but later the team ended up implementing quick-and-dirty logic meant to solve some of the same issues as above in our web service architecture. We had less time to provide business value because we were fixing so many concurrency and error-recovery issues.

So while a message broker may seem on its face like a heavyweight solution, and may actually be more than you need right now, it does have a lot of benefits that you may need later without yet realizing it.

南风几经秋 2024-12-25 15:35:25

正如 wberry 提到的,基于 REST 或 web-hook 的解决方案可以发挥作用,但对故障的容忍度不高。预先支付消息传递的运营价格将带来长期红利,因为您会发现与消息传递模型自然匹配的其他问题。

关于其他 OSS 选项;如果您除了这个特定用例之外还考虑基于流的处理,我建议您查看 Apache卡夫卡。 Kafka 提供与 RabbitMQ 类似的消息传递语义,但紧紧专注于处理消息流(更不用说它已经在 LinkedIn 的生产环境中经过了实际测试)。

As wberry alluded to, a REST or web-hook based solution can be functional but will not be very tolerant to failure. Paying the operations price up front for messaging will pay long term dividends as you will find additional problems which are a natural fit for the messaging model.

Regarding other OSS options; If you are considering stream based processing in addition to this specific use case, I would recommend taking a look at Apache Kafka. Kafka provides similar messaging semantics to RabbitMQ, but is tightly focused on processing message streams (not to mention that is has been battle tested in production at LinkedIn).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文