在Python中将S3数据并行/异步下载到EC2中?
我在 S3 中存储了需要分析的大型数据文件。 每批次由约 50 个文件组成,每个文件都可以独立分析。
我想设置将 S3 数据并行下载到 EC2 实例中,并设置触发器来启动对每个下载文件的分析过程。
是否有任何库可以处理异步下载,在完整模型上触发?
如果没有,我正在考虑使用 pyprocessing 设置多个下载进程,每个下载进程都会下载并分析单个文件。 这听起来合理还是有更好的选择?
I have large data files stored in S3 that I need to analyze. Each batch consists of ~50 files, each of which can be analyzed independently.
I'd like to setup parallel downloads of the S3 data into the EC2 instance, and setup triggers that start the analysis process on each file that downloads.
Are there any libraries that handle an async download, trigger on complete model?
If not, I'm thinking of setting up multiple download processes with pyprocessing, each of which will download and analyze a single piece of the file. Does that sound reasonable or are there better alternatives?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了回答我自己的问题,我最终对 Amazon S3 python 库进行了简单的修改,让您可以分块下载文件或逐行读取文件。 在此处可用< /a>.
Answering my own question, I ended up writing a simple modification to the Amazon S3 python library that lets you download the file in chunks or read it line by line. Available here.
听起来您正在寻找 twisted:
“Twisted 是一个用 Python 编写的事件驱动的网络引擎,根据麻省理工学院的许可获得许可。”
http://twistedmatrix.com/trac/
我已经在很多异步项目中使用了twisted python涉及通过互联网进行通信以及与子进程的通信。
It sounds like you're looking for twisted:
"Twisted is an event-driven networking engine written in Python and licensed under the MIT license."
http://twistedmatrix.com/trac/
I've used the twisted python for quite a few asynchronous projects involving both communicating over the Internet and with subprocesses.
我不知道有什么已经存在的东西可以完全满足您的需求,但即使不是,它也应该很容易与 Python 组合在一起。 对于线程方法,您可以看看这个 Python 配方 进行多线程 HTTP 下载以测试下载镜像。
编辑:我发现有几个软件包可以为您完成大部分工作并且正是您正在寻找的
I don't know of anything that already exists that does exactly what you're looking for, but even if not it should be reasonably easy to put together with Python. For a threaded approach, you might take a look at this Python recipe that does multi-threaded HTTP downloads for testing download mirrors.
EDIT: Few packages that I found that might do the majority of the work for you and be what you're looking for