从S3下载大量文件

发布于 2024-07-25 09:36:41 字数 211 浏览 8 评论 0原文

使用 Python 从 Amazon S3 获取大量文件(相对较小的 10-50kB)的最快方法是什么? (大约 200,000 - 百万个文件)。

目前我正在使用 boto 生成签名 URL,并使用 PyCURL 来一一获取文件。

某种类型的并发会有帮助吗? PyCurl.CurlMulti 对象?

我对所有建议持开放态度。 谢谢!

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).

At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.

Would some type of concurrency help? PyCurl.CurlMulti object?

I am open to all suggestions. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

慕烟庭风 2024-08-01 09:36:42

我一直在使用 txaws 和twisted 进行S3 工作,尽管您可能想要的只是获取经过身份验证的URL 并使用twisted.web.client.DownloadPage (默认情况下会很高兴从流到文件,无需太多交互)。

Twisted 可以轻松地以您想要的任何并发速度运行。 对于 200,000 个数量级的内容,我可能会制作一个生成器并使用合作器来设置我的并发性,然后让生成器生成每个所需的下载请求。

如果您不熟悉twisted,您会发现该模型需要一些时间来适应,但这是非常值得的。 在这种情况下,我希望它占用最少的 CPU 和内存开销,但您必须担心文件描述符。 如果您发现自己需要更多文件描述符,或者您希望通过多个连接来断开连接,则可以很容易地混合透视代理并将工作分配到多台计算机上。

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).

Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.

If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

就像说晚安 2024-08-01 09:36:42

线程 + 队列怎么样,我喜欢这篇文章:Python 实用线程编程< /a>

what about thread + queue, I love this article: Practical threaded programming with Python

行至春深 2024-08-01 09:36:42

每项工作都可以使用适当的工具来完成:)

你想使用python进行S3压力测试:),所以我建议找到一个大容量的下载程序并传递链接到它。

在 Windows 上,我有安装 ReGet 程序(共享软件,来自 http://reget.com)并通过以下方式创建下载任务的经验: COM接口。

当然,可能还有其他具有可用界面的程序存在。

问候!

Each job can be done with appropriate tools :)

You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.

On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.

Of course there may other programs with usable interface exists.

Regards!

ま柒月 2024-08-01 09:36:41

我对 python 一无所知,但一般来说,您会希望将任务分解成更小的块,以便它们可以同时运行。 您可以按文件类型、字母顺序或其他方式将其分解,然后为分解的每个部分运行单独的脚本。

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

同尘 2024-08-01 09:36:41

对于Python来说,由于这是IO绑定的,多个线程将使用CPU,但它可能只会使用一个核心。 如果您有多个核心,您可能需要考虑新的 多处理器 模块。 即使如此,您可能仍希望每个进程使用多个线程。 您必须对处理器和线程的数量进行一些调整。

如果您确实使用多个线程,那么这是 Queue 类的良好候选者。

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.

If you do use multiple threads, this is a good candidate for the Queue class.

执手闯天涯 2024-08-01 09:36:41

您可以考虑使用 s3fs,并仅从 Python 运行并发文件系统命令。

You might consider using s3fs, and just running concurrent file system commands from Python.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文