从 S3 存储桶下载 300 万个对象的最快方法

发布于 2024-10-12 15:59:19 字数 475 浏览 8 评论 0原文

我尝试过使用 Python + boto + 多处理、S3cmd 和 J3tset，但都在努力解决。

有什么建议，也许是您一直在使用的现成脚本或我不知道的其他方式？

编辑：

eventlet+boto 是一个有价值的解决方案，如下所述。在这里找到了一篇很好的 eventlet 参考文章 http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

我已经在下面添加了我现在正在使用的 python 脚本。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

木緿 2024-10-19 15:59:19

好的，我根据@Matt Billenstien 的提示找到了一个解决方案。它使用 eventlet 库。这里第一步是最重要的（标准 IO 库的猴子修补）。

使用 nohup 在后台运行此脚本，一切就完成了。

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)


def download_file(key_name):
    # Its imp to download the key from a new connection
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")
    key = bucket.get_key(key_name)

    try:
        res = key.get_contents_to_filename(key.name)
    except:
        logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")

    logging.info("Fetching bucket list")
    bucket_list = bucket.list(prefix="PREFIX")

    logging.info("Creating a pool")
    pool = GreenPool(size=20)

    logging.info("Saving files in bucket...")
    for key in bucket.list():
        pool.spawn_n(download_file, key.key)
    pool.waitall()

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

Run this script in the background with nohup and you're all set.

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)


def download_file(key_name):
    # Its imp to download the key from a new connection
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")
    key = bucket.get_key(key_name)

    try:
        res = key.get_contents_to_filename(key.name)
    except:
        logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")

    logging.info("Fetching bucket list")
    bucket_list = bucket.list(prefix="PREFIX")

    logging.info("Creating a pool")
    pool = GreenPool(size=20)

    logging.info("Saving files in bucket...")
    for key in bucket.list():
        pool.spawn_n(download_file, key.key)
    pool.waitall()

回复收藏 0 原文