我可以将没有内容长度标头的文件上传到 S3 吗?

发布于 2024-12-23 08:59:57 字数 376 浏览 0 评论 0原文

我正在一台内存有限的机器上工作,我想以流式传输方式将动态生成的(非磁盘)文件上传到 S3。换句话说,当我开始上传时我不知道文件大小,但到最后我就会知道。通常,PUT 请求具有 Content-Length 标头,但也许有办法解决此问题,例如使用多部分或分块内容类型。

S3可以支持流式上传。例如,请参见此处:

http://blog.odonnell.nu/posts/streaming -uploads-s3-python-and-poster/

我的问题是,我可以完成同样的事情而不必在上传开始时指定文件长度吗?

I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.

S3 can support streaming uploads. For example, see here:

http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/

My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

深陷 2024-12-30 08:59:57

您必须通过 S3 的多部分 API 以 5MiB+ 块上传文件。每个块都需要一个内容长度,但您可以避免将大量数据 (100MiB+) 加载到内存中。

  • 启动 S3分段上传
  • 将数据收集到缓冲区中,直到该缓冲区达到 S3 的块大小下限 (5MiB)。建立缓冲区时生成 MD5 校验和。
  • 将该缓冲区作为部分上传,存储 ETag(阅读该缓冲区的文档)。
  • 达到数据 EOF 后,上传最后一个块(可能小于 5MiB)。
  • 完成分段上传。

S3 允许最多 10,000 个零件。因此,通过选择 5MiB 的部分大小,您将能够上传高达 50GiB 的动态文件。对于大多数用例来说应该足够了。

但是:如果您需要更多,则必须增加零件尺寸。通过使用更大的部分大小(例如 10MiB)或在上传过程中增加它。

First 25 parts:   5MiB (total:  125MiB)
Next 25 parts:   10MiB (total:  375MiB)
Next 25 parts:   25MiB (total:    1GiB)
Next 25 parts:   50MiB (total: 2.25GiB)
After that:     100MiB

这将允许您上传高达 1TB 的文件(S3 目前对单个文件的限制为 5TB),而不会不必要地浪费内存。


关于Sean O'Donnells 博客链接的注释

他的问题与你的不同 - 他在上传之前知道并使用内容长度。他希望改进这种情况:许多库通过将文件中的所有数据加载到内存中来处理上传。在伪代码中,这将是这样的:

data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()

他的解决方案通过文件系统 API 获取 Content-Length 来实现。然后,他将数据从磁盘流式传输到请求流中。在伪代码中:

upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()

input = File.open(file_name, File::READONLY_FLAG)

while (data = input.read())
  input.write(data)
end

upload.flush()
upload.close()

You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.

  • Initiate S3 Multipart Upload.
  • Gather data into a buffer until that buffer reaches S3's lower chunk-size limit (5MiB). Generate MD5 checksum while building up the buffer.
  • Upload that buffer as a Part, store the ETag (read the docs on that one).
  • Once you reach EOF of your data, upload the last chunk (which can be smaller than 5MiB).
  • Finalize the Multipart Upload.

S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.

However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.

First 25 parts:   5MiB (total:  125MiB)
Next 25 parts:   10MiB (total:  375MiB)
Next 25 parts:   25MiB (total:    1GiB)
Next 25 parts:   50MiB (total: 2.25GiB)
After that:     100MiB

This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.


A note on your link to Sean O'Donnells blog:

His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:

data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()

His solution does it by getting the Content-Length via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:

upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()

input = File.open(file_name, File::READONLY_FLAG)

while (data = input.read())
  input.write(data)
end

upload.flush()
upload.close()
咿呀咿呀哟 2024-12-30 08:59:57

将此答案放在这里供其他人使用,以防有帮助:

如果您不知道流式传输到 S3 的数据的长度,您可以使用 S3FileInfo 及其 OpenWrite() 方法将任意数据写入S3。

var fileInfo = new S3FileInfo(amazonS3Client, "MyBucket", "streamed-file.txt");

using (var outputStream = fileInfo.OpenWrite())
{
    using (var streamWriter = new StreamWriter(outputStream))
    {
        streamWriter.WriteLine("Hello world");
        // You can do as many writes as you want here
    }
}

Putting this answer here for others in case it helps:

If you don't know the length of the data you are streaming up to S3, you can use S3FileInfo and its OpenWrite() method to write arbitrary data into S3.

var fileInfo = new S3FileInfo(amazonS3Client, "MyBucket", "streamed-file.txt");

using (var outputStream = fileInfo.OpenWrite())
{
    using (var streamWriter = new StreamWriter(outputStream))
    {
        streamWriter.WriteLine("Hello world");
        // You can do as many writes as you want here
    }
}
好多鱼好多余 2024-12-30 08:59:57

您可以使用 gof3r 命令行工具来传输 Linux 管道:

$ tar -czf - <my_dir/> | gof3r put --bucket <s3_bucket> --key <s3_object>

You can use the gof3r command-line tool to just stream linux pipes:

$ tar -czf - <my_dir/> | gof3r put --bucket <s3_bucket> --key <s3_object>
£烟消云散 2024-12-30 08:59:57

如果您使用 Node.js,则可以使用 s3-streaming-upload 很容易实现这一点。

If you are using Node.js you can use a plugin like s3-streaming-upload to accomplish this quite easily.

浅浅 2024-12-30 08:59:57

请参阅有关 HTTP 多部分实体请求的更多信息。您可以将文件作为数据块发送到目标。

Refer more on HTTP multi-part enitity requests. You can send a file as chunks of data to the target.

ゞ记忆︶ㄣ 2024-12-30 08:59:57

参考:https://github.com/aws/aws-cli/pull/903

这是一个概要:
要将流从 stdin 上传到 s3,请使用:
aws s3 cp - s3://my-bucket/stream

要将 s3 对象下载为标准输出流,请使用:
aws s3 cp s3://my-bucket/stream -

例如,如果我有对象 s3://my-bucket/stream,我可以运行以下命令:
aws s3 cp s3://my-bucket/stream - | aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream

我的 cmd:

echo "ccc" | aws --endpoint-url=http://172.22.222.245:80 --no-verify-ssl s3 cp - s3://test-bucket/ccc

reference to :https://github.com/aws/aws-cli/pull/903

Here is a synopsis:
For uploading a stream from stdin to s3, use:
aws s3 cp - s3://my-bucket/stream

For downloading an s3 object as a stdout stream, use:
aws s3 cp s3://my-bucket/stream -

So for example, if I had the object s3://my-bucket/stream, I could run this command:
aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream

my cmd:

echo "ccc" | aws --endpoint-url=http://172.22.222.245:80 --no-verify-ssl s3 cp - s3://test-bucket/ccc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文