同步文件树的数据结构
我正在编写一个应用程序,该应用程序需要在客户端和(http)服务器之间同步文件结构。
文件结构本质上是文件路径列表,其中每个路径都是与 1 个或多个数据块 ID(对实际数据块的 256 位引用)连接的字符串。一个数据块可以被多个文件引用,因此路径和 ID 之间存在 nm 关系。现在它只是具有 id 的路径列表,但如果同步需要的话,它可以轻松转换为路径表示的树结构。
我正在寻找一种数据结构,可以让我有效地同步这些数据。主要实现两个目标:
- 一个文件中的更改不应强制客户端将整个文件结构发送到服务器,而只发送其中的一小部分。
- 如果更改了许多文件,则应将这些更改分组在一起。例如,1000 个更改不会导致向服务器发出 1000 个请求。
正如您所看到的,这些目标有点冲突,因此我正在寻找能够在它们之间找到良好中间立场的东西。第二个目标可以通过将多个更改分组到一个 http 请求中轻松实现,但是服务器所需的处理(解析 HTTP 请求请求的所有更改)在计算方面应该非常便宜。
我还应该提到,可能有多个客户端在服务器上同步相同的结构。因此,必须能够轻松地检测一个客户端的更改,然后将其同步到另一客户端(即,它不仅仅是上传到服务器)。
我当然不是第一个做这样的事情的人,所以我认为有一些聪明的解决方案可用。例如,我猜想 Dropbox 和 Subversion 在同步元数据时都有类似的要求。有人知道他们是如何实施的吗?
I'm in the process of a writing an application, which needs to synchronize a file-structure between a client and a (http) server.
The file-structure is essentially a list of file-paths where each path is a string connected with 1 or more data-block ids (256-bit reference to the actual data-block). A data-block can be referenced by several files so there's a n-m relation between paths and ids. Right now it is just a list of paths with there ids, but it can easily be converted to the tree structure which the paths represent, if that's necessary for the synchronization.
I'm looking for a data structure which allows me to sync this data efficiently. Mainly achieving two goals:
- A change in one file should not force the client to send the entire file-strcuture to the server, only a small subset of it.
- If many files are changed these changes should be grouped together. E.g. so that 1000 changes doesn't result in 1000 requests to the server.
As you see, the goals are a bit conflicting and I'm therefore looking for something which finds a good middleground between them. The second goal can easily be achieved by grouping several changes into one http-request, but then the processing required by the server (to parse all changes requested by the HTTP-request) should be very inexpensive, computing wise.
I should also mention that there could be several clients synchronizing the same structure on the server. It must therefore be easy to detect the changes by one client and then syncrhonize it to an other client (i.e. it's not just an upload to the server).
I'm certainly not the first one doing something like this, so I assume there are some smart solutions available. For instance, I guess both Dropbox and Subversion have similar requirements when they sync their meta-data. Does anyone happen to know how they have implemented it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有什么理由不使用rsync?如果您需要以编程方式控制它,可以使用 librsync。
subversion源代码是开放的,所以你可以检查一下。另外,我知道 Mercurial 有一个非常智能的线路协议,可以最大限度地减少流量。
Any reason not to use rsync? If you need to programmatically control it, there is librsync.
The subversion source code is open, so you could check that. Also, I know that Mercurial has a pretty smart wire protocol for minimizing traffic.
我决定使用事务日志来解决这个问题。每个客户端将对树的所有更改保存到事务日志中(除了它还保留的树的本地数据库之外),并定期与服务器同步。日志只是带有文件->数据块-id 和时间戳的条目列表。
当日志发送到服务器后,它就会从客户端删除。在上传日志之前,它还会要求其他客户端写入同一棵树的日志。然后将这些日志合并到本地树中。
日志本身将使用 Azure Blob 存储存储在服务器上。服务器可以定期从日志中删除旧条目(如果它变得很大)。
这样,客户端可以有效地相互传达其更改,而服务器不必对每个请求进行任何昂贵的处理。
I've decided to solve this using a transaction-log. Each clients saves all changes to the tree to a transaction-log (in addition to the local db of the tree which it also keeps), which it periodically syncs with the server. The log is just a list of entries with file->datablock-id's and a timestamp.
When the log has been sent to the server it is removed from the client. Before uploading the log it also asks for logs written by other clients to the same tree. These logs are then merged into the local tree.
The log itself will be stored on the server using Azure Blob Storage. The server can periodically remove old entries from the log (if it grows to big).
This way the clients efficiently can communicate its' changes with each other while the server doesn't have to any expensive processing on each request.