Dropbox 声称,在同步过程中,仅将更改的文件部分传输回主服务器,这显然是一个很棒的功能,但它们如何对存储在 Amazon S3 云中的文件进行更改?举例来说,假设用户桌面上的一个 30 页文档仅包含对第 4 页的更改。Dropbox 现在会同步代表更改的块,如果他们存储的文件存储在云中,后端会发生什么情况?这是否意味着他们必须将存储在 S3 中的 30 页文档下载到他们的服务器,然后执行代表第 4 页的块的替换,然后上传回云端?我怀疑情况会是这样,因为这会有些低效。我能想到的另一个选择是,Amazon S3 是否根据字节范围提供存储在云中的文件的更新,例如,向文件 X 从字节 100-200 发出 PUT 请求,这将替换从 100 到 200 的所有字节与 PUT 请求的值。所以我很好奇使用亚马逊等其他云服务的公司如何实现这种类型的同步。
谢谢
Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
发布评论
评论(2)
由于 S3 和类似的存储不提供文件系统功能,因此任何假装存储文件和目录的东西都需要模拟文件系统。在执行此操作时,文件通常会被分割为一定大小的页面,其中每个页面都存储在存储中的单独文件中。这样,更改的块只需要上传一页(例如),而不是整个文件。我应该注意,对于像 Office 文档这样的文件,如果文件大小发生更改,这种方法可能会出错 - 例如,如果您在开头插入页面或删除页面,则整个文件将被更改,并且需要完整的文件需要重新上传。我们没有具体分析 Dropbox 是如何完成他的工作的,我只是描述了常见的场景。还存在不同的“补丁算法”,可以在本地创建补丁(如果 Dropbox 在缓存中有较旧的本地副本),然后将其应用于服务器上的一个或多个块。
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
有几种通过线路传输增量的同步工具,如 rsync、rdiff、rdiff-backup 等。对于与 S3 的双向同步,有付费服务,如 s3rsync 例如。对于纯粹的客户端同步,可以考虑像 zsync 这样的工具(这是许多人用来滚动的工具) - 输出应用程序更新)。
另一种方法是对目录进行压缩包,生成增量文件(使用 rdiff 或 xdelta3),然后使用时间戳作为密钥的一部分来上传增量文件。为了同步,您需要做的就是在客户端执行这两项检查:
这里涉及的因素是客户端至少 100% 的额外空间利用率。但这种方法将帮助您在需要时恢复更改。
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.