如何检查亚马逊S3上的两个大文件是否相同?

发布于 2025-01-04 07:44:21 字数 943 浏览 4 评论 0原文

我需要使用 boto 在亚马逊 S3 上将大文件(> 5GB)从同一个存储桶移动到同一个存储桶。为此,我需要使用 multipart API,它不使用 etag 的 md5 和。

虽然我认为(只有 98% 确定)我的代码是正确的,但我想在删除原始代码之前验证新副本没有损坏。然而,除了下载两个对象并在本地比较它们之外,我找不到任何方法,这对于 5GB 以上的文件来说是一个相当漫长的过程。

作为记录,下面是我用 boto 复制大文件的代码,也许这可以帮助某人。如果我的问题没有好的解决方案,也许有人会发现错误并阻止我损坏数据。

import boto

copy_size = 1000000000  #1e9
bucket_name = 'mybucket'
orig_key_name = 'ABigFile'
dest_key_name = 'ABigFile.clone'

s3 = boto.connect_s3()
mybucket = s3.get_bucket(bucket_name)

key = mybucket.get_key(orig_key_name)

mp = mybucket.initiate_multipart_upload(dest_key_name)  #keyname

print 'key size: ', key.size

count = 1 
start = 0
end = -1

while end < key.size-1:
   print 'count: ', count
   start = end + 1 
   end = min( key.size -1 , start + copy_size )
   mp.copy_part_from_key(bucket_name, orig_key_name, count , start, end )
   count+=1

mp.complete_upload()

此代码仅适用于原始密钥大小 >= 5368709121 字节。

I need to move large files (>5GB) on amazon S3 with boto, from and to the same bucket. For this I need to use the multipart API, which does not use md5 sums for etags.

While I think (well only 98% sure) that my code is correct, I would like to verify that the new copy is not corrupted before deleting the original. However I could not find any method except downloading both objects and comparing them locally, which for 5GB+ files is quite a long process.

For the record, below is my code to copy a large file with boto, maybe this can help someone. If there is no good solution to my problem maybe someone will find a bug and prevent me from corrupting data.

import boto

copy_size = 1000000000  #1e9
bucket_name = 'mybucket'
orig_key_name = 'ABigFile'
dest_key_name = 'ABigFile.clone'

s3 = boto.connect_s3()
mybucket = s3.get_bucket(bucket_name)

key = mybucket.get_key(orig_key_name)

mp = mybucket.initiate_multipart_upload(dest_key_name)  #keyname

print 'key size: ', key.size

count = 1 
start = 0
end = -1

while end < key.size-1:
   print 'count: ', count
   start = end + 1 
   end = min( key.size -1 , start + copy_size )
   mp.copy_part_from_key(bucket_name, orig_key_name, count , start, end )
   count+=1

mp.complete_upload()

This code only works for original key sizes >= 5368709121 bytes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凝望流年 2025-01-11 07:44:21

您应该能够计算数据流上的 SHA-1 哈希值 (查看此 SO 线程 的 C++ 代码,它可以为 python 方法提供提示)。通过将哈希数据流重定向到 /dev/null 的等效项,您应该能够比较两个文件的 SHA-1 哈希值,而无需先将它们下载到本地。

You should be able to compute a SHA-1 hash on a data stream (see this SO thread for C++ code, which could give hints for a python approach). By redirecting your hashed data stream to the equivalent of /dev/null, you should be able to compare SHA-1 hashes of two files without first downloading them locally.

陌若浮生 2025-01-11 07:44:21

如果不知道 AWS 如何计算分段上传的 etag,就无法执行您想要的操作。如果您有对象的本地副本,则可以计算在本地对象上复制的每个部分的 md5,并将其与每个 mp.copy_part_from_key() 返回的键中的 etag 进行比较。听起来你没有本地对象。

boto 中还隐藏着一个不明显的小问题,在极少数情况下可能会或可能不会导致您丢失数据。如果您查看 boto 源代码,您会注意到 mp.complete_upload() 函数实际上不会对 AWS 在上传时返回的任何部分使用任何 etag。当您使用 multipart_complete 时,它​​实际上本身会创建一个全新的多部分列表,并从 S3 获取新的部分和 etag 列表。由于最终一致性,这是有风险的,并且列表可能完整也可能不完整。理想情况下,multipart_complete() 应使用每个远程副本返回的 etag 和部分信息,以确保完全安全。这是 Amazon 在其文档中建议的内容(请参阅分段上传列表<下的注释) /a>)。

也就是说,如果您确认两个对象的文件大小相同,则不太可能出现问题。我认为最糟糕的情况是某个部分未在分段上传列表中列出。列出的部分本身不应该是错误的。

There is no way to do what you want without knowing how AWS calculates the etag on multipart uploads. If you have a local copy of the object, you can calculate the md5 of each part that you are copying on the local object and compare it to the etag in the key that each mp.copy_part_from_key() returns. Sounds like you have no local object though.

You also have a small non-obvious problem hiding in boto that may or may not cause you to lose data in a very rare case. If you look at the boto source code, you'll notice that mp.complete_upload() function actually doesn't use any of the etags for any of the parts returned by AWS when uploading. When you use multipart_complete, it actually does a totally new multipart list itself and gets a new list of parts and etags from S3. This is risky because of eventual consistency and the list may or may not be complete. The multipart_complete() should ideally use the etags and part info that was returned by each remote copy to be completely safe. This is what Amazon recommends in its documentation (See the Note under Multipart Upload Listings).

That said, it's less likely a problem if you confirm the file size of both objects to be the same. The worst case I believe is that a part is not listed in the multipart upload listing. A listed part should never be incorrect itself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文