将文件从 SFTP 递归移动到 S3 并保留结构

发布于 2025-01-11 01:05:23 字数 498 浏览 4 评论 0原文

我正在尝试将文件从 SFTP 服务器递归移动到 S3，可能使用 boto3。我也想保留文件夹/文件结构。我想这样做：

import pysftp

private_key = "/mnt/results/sftpkey"

srv = pysftp.Connection(host="server.com", username="user1", private_key=private_key)

srv.get_r("/mnt/folder", "./output_folder")

然后使用 boto3 将这些文件上传到 S3。但服务器上的文件夹和文件数量多、层次深、体积大。所以我的机器最终耗尽了内存和磁盘空间。我正在考虑一个脚本，我可以在其中下载单个文件并上传单个文件，然后删除并重复。

我知道这需要很长时间才能完成，但我可以将其作为一项作业来运行，而不会耗尽空间，并且不必让我的计算机始终保持打开状态。有人做过类似的事情吗？任何帮助表示赞赏！

原文

I'm trying to recursively move files from an SFTP server to S3, possibly using boto3. I want to preserve the folder/file structure as well. I was looking to do it this way:

import pysftp

private_key = "/mnt/results/sftpkey"

srv = pysftp.Connection(host="server.com", username="user1", private_key=private_key)

srv.get_r("/mnt/folder", "./output_folder")

Then take those files and upload them to S3 using boto3. However, the folders and files on the server are numerous with deep levels and also large in size. So my machine ends up running out of memory and disk space. I was thinking of a script where I could download single files and upload single files and then delete and repeat.

I know this would take a long time to finish, but I can run this as a job without running out of space and not keep my machine open the entire time. Has anyone done something similar? Any help is appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遗弃Ｍ 2025-01-18 01:05:23

如果您无法（或不想）在将所有文件发送到 S3 之前一次下载所有文件，则需要一次下载一个。

此外，从这里开始，您需要构建要下载的文件列表，然后对其进行处理，将一个文件传输到本地计算机，然后将其发送到 S3。

一个非常简单的版本看起来像这样：

import pysftp
import stat
import boto3
import os
import json

# S3 bucket and prefix to upload to
target_bucket = "example-bucket"
target_prefix = ""
# Root FTP folder to sync
base_path = "./"
# Both base_path and target_prefix should end in a "/"
# Or, for the prefix, be empty for the root of the bucket
srv = pysftp.Connection(
    host="server.com", 
    username="user1", 
    private_key="/mnt/results/sftpkey",
)

if os.path.isfile("all_files.json"):
    # No need to cache files more than once. This lets us restart 
    # on a failure, though really we should be caching files in 
    # something more robust than just a json file
    with open("all_files.json") as f:
        all_files = json.load(f)
else:
    # No local cache, go ahead and get the files
    print("Need to get list of files...")
    todo = [(base_path, target_prefix)]
    all_files = []

    while len(todo):
        cur_dir, cur_prefix = todo.pop(0)
        print("Listing " + cur_dir)
        for cur in srv.listdir_attr(cur_dir):
            if stat.S_ISDIR(cur.st_mode):
                # A directory, so walk into it
                todo.append((cur_dir + cur.filename + "/", cur_prefix + cur.filename + "/"))
            else:
                # A file, just add it to our cache
                all_files.append([cur_dir + cur.filename, cur_prefix + cur.filename])

    # Save the cache out to disk    
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

# And now, for every file in the cache, download it
# and turn around and upload it to S3
s3 = boto3.client('s3')
while len(all_files):
    ftp_file, s3_name = all_files.pop(0)

    print("Downloading " + ftp_file)
    srv.get(ftp_file, "_temp_")
    print("Uploading " + s3_name)
    s3.upload_file("_temp_", target_bucket, s3_name)

    # Clean up, and update the cache with one less file
    os.unlink("_temp_")
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

srv.close()

错误检查和速度改进显然是可能的。

If you can't (or don't want) to download all of the files at once before sending them to S3, then you need to download them one at a time.

Further, from there, it follows that you'll need to build a list of files to download, then work on them, transferring one file to your local computer, then sending it to S3.

A very simple version of this would look something like this:

import pysftp
import stat
import boto3
import os
import json

# S3 bucket and prefix to upload to
target_bucket = "example-bucket"
target_prefix = ""
# Root FTP folder to sync
base_path = "./"
# Both base_path and target_prefix should end in a "/"
# Or, for the prefix, be empty for the root of the bucket
srv = pysftp.Connection(
    host="server.com", 
    username="user1", 
    private_key="/mnt/results/sftpkey",
)

if os.path.isfile("all_files.json"):
    # No need to cache files more than once. This lets us restart 
    # on a failure, though really we should be caching files in 
    # something more robust than just a json file
    with open("all_files.json") as f:
        all_files = json.load(f)
else:
    # No local cache, go ahead and get the files
    print("Need to get list of files...")
    todo = [(base_path, target_prefix)]
    all_files = []

    while len(todo):
        cur_dir, cur_prefix = todo.pop(0)
        print("Listing " + cur_dir)
        for cur in srv.listdir_attr(cur_dir):
            if stat.S_ISDIR(cur.st_mode):
                # A directory, so walk into it
                todo.append((cur_dir + cur.filename + "/", cur_prefix + cur.filename + "/"))
            else:
                # A file, just add it to our cache
                all_files.append([cur_dir + cur.filename, cur_prefix + cur.filename])

    # Save the cache out to disk    
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

# And now, for every file in the cache, download it
# and turn around and upload it to S3
s3 = boto3.client('s3')
while len(all_files):
    ftp_file, s3_name = all_files.pop(0)

    print("Downloading " + ftp_file)
    srv.get(ftp_file, "_temp_")
    print("Uploading " + s3_name)
    s3.upload_file("_temp_", target_bucket, s3_name)

    # Clean up, and update the cache with one less file
    os.unlink("_temp_")
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

srv.close()

Error checking, and speed improvements are obviously possible.

回复收藏 0 原文