关于在工作节点上分发 python 数据/代码的建议?

发布于 2025-01-06 08:23:54 字数 404 浏览 3 评论 0原文

我开始尝试分布式代码,但无法根据现有的所有内容找出适合我需求的解决方案。基本上我有一个 python 数据列表,我需要用一个函数来处理它。该函数有一些嵌套的 for 循环,但对于列表中的每个项目都不会花费太长时间(大约一分钟)。我的问题是列表非常大(3000+ 项)。我正在考虑多重处理,但我想我想尝试多服务器处理(因为理想情况下,如果数据变得更大,我希望能够选择在工作期间添加更多服务器以使其运行得更快) 。

我基本上是在寻找可以通过它分发此数据列表的东西(不是超级需要的,但如果我也可以通过它分发我的代码库,那就太好了)

所以我的问题是,我可以使用什么包来实现这一点?我的数据库是 hbase,所以我已经运行了 hadoop(不过从未使用过 hadoop,只是将其用于数据库)。我也看了芹菜和扭曲,但我很困惑哪个适合我的需要。

有什么建议吗?

I'm starting to venture into distributed code and am having trouble figuring out which solution fits my needs based on all the stuff out there. Basically I have a python list of data that I need to process with a single function. This function has a few nested for loops but doesn't take too long(about a min) for each item on the list. My problem is the list is very large(3000+ items). I'm looking at multiprocessing but I think I want to experiment with multi-server processing it(because ideally, if the data gets larger I want to be able to have the choice of adding more servers during the job to make it run quicker).

I basically looking for something that I can distribute this data list through(and not super needed but it would be nice if I could distribute my code base through this also)

So my question is, what package can I use to achieve this? My database is hbase so I already have hadoop running(never used hadoop though, just using it for the database). I looked at celery and twisted as well but I'm confused on which will fit my needs.

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

眼趣 2025-01-13 08:23:54

我强烈推荐 celery。您可以定义一个对列表中的单个项目进行操作的任务:

from celery.task import task
@task
def process(i):
    # do something with i
    i += 1
    # return a result
    return i

您可以轻松地并行化列表,如下所示:

results = []
todo = [1,2,3,4,5]
for arg in todo:
    res = process.apply_async(args=(arg))
    results.append(res)

all_results = [res.get() for res in results]

只需添加更多 celery 工作人员即可轻松扩展。

I would highly recommend celery. You can define a task that operates on a single item of your list:

from celery.task import task
@task
def process(i):
    # do something with i
    i += 1
    # return a result
    return i

You can easily parallelize a list like this:

results = []
todo = [1,2,3,4,5]
for arg in todo:
    res = process.apply_async(args=(arg))
    results.append(res)

all_results = [res.get() for res in results]

This is easily scalable by just adding more celery workers.

长安忆 2025-01-13 08:23:54

查看 rabbitMQ。 Python 绑定可通过 pika 获得。从一个简单的 work_queue 开始并运行一些 rpc 调用

使用像rabbitMQ这样的外部引擎在python中试验分布式计算可能看起来很麻烦(安装和配置rabbit有一个小的学习曲线),但稍后您可能会发现它更有用。

...芹菜可以与rabbitMQ携手合作,查看 robert pogorzelski 的教程 使用 Celery 和 RabbitMQ 的简单分布式任务

check out rabbitMQ. Python bindings are available through pika. start with a simple work_queue and run few rpc calls.

It may look troublesome to experiment distributed computing in python with an external engine like rabbitMQ (there's a small learning curve for installing and configuring the rabbit) but you may find it even more useful later.

... and celery can work hand-in-hand with rabbitMQ, checkout robert pogorzelski's tutorial and Simple distributed tasks with Celery and RabbitMQ

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文