关于在工作节点上分发 python 数据/代码的建议?
我开始尝试分布式代码,但无法根据现有的所有内容找出适合我需求的解决方案。基本上我有一个 python 数据列表,我需要用一个函数来处理它。该函数有一些嵌套的 for 循环,但对于列表中的每个项目都不会花费太长时间(大约一分钟)。我的问题是列表非常大(3000+ 项)。我正在考虑多重处理,但我想我想尝试多服务器处理(因为理想情况下,如果数据变得更大,我希望能够选择在工作期间添加更多服务器以使其运行得更快) 。
我基本上是在寻找可以通过它分发此数据列表的东西(不是超级需要的,但如果我也可以通过它分发我的代码库,那就太好了)
所以我的问题是,我可以使用什么包来实现这一点?我的数据库是 hbase,所以我已经运行了 hadoop(不过从未使用过 hadoop,只是将其用于数据库)。我也看了芹菜和扭曲,但我很困惑哪个适合我的需要。
有什么建议吗?
I'm starting to venture into distributed code and am having trouble figuring out which solution fits my needs based on all the stuff out there. Basically I have a python list of data that I need to process with a single function. This function has a few nested for loops but doesn't take too long(about a min) for each item on the list. My problem is the list is very large(3000+ items). I'm looking at multiprocessing but I think I want to experiment with multi-server processing it(because ideally, if the data gets larger I want to be able to have the choice of adding more servers during the job to make it run quicker).
I basically looking for something that I can distribute this data list through(and not super needed but it would be nice if I could distribute my code base through this also)
So my question is, what package can I use to achieve this? My database is hbase so I already have hadoop running(never used hadoop though, just using it for the database). I looked at celery and twisted as well but I'm confused on which will fit my needs.
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我强烈推荐 celery。您可以定义一个对列表中的单个项目进行操作的任务:
您可以轻松地并行化列表,如下所示:
只需添加更多 celery 工作人员即可轻松扩展。
I would highly recommend celery. You can define a task that operates on a single item of your list:
You can easily parallelize a list like this:
This is easily scalable by just adding more celery workers.
查看 rabbitMQ。 Python 绑定可通过 pika 获得。从一个简单的 work_queue 开始并运行一些 rpc 调用。
使用像rabbitMQ这样的外部引擎在python中试验分布式计算可能看起来很麻烦(安装和配置rabbit有一个小的学习曲线),但稍后您可能会发现它更有用。
...芹菜可以与rabbitMQ携手合作,查看 robert pogorzelski 的教程 和 使用 Celery 和 RabbitMQ 的简单分布式任务
check out rabbitMQ. Python bindings are available through pika. start with a simple work_queue and run few rpc calls.
It may look troublesome to experiment distributed computing in python with an external engine like rabbitMQ (there's a small learning curve for installing and configuring the rabbit) but you may find it even more useful later.
... and celery can work hand-in-hand with rabbitMQ, checkout robert pogorzelski's tutorial and Simple distributed tasks with Celery and RabbitMQ