通过网络高效复制文件的算法
我知道有几个程序可以通过网络同步文件。他们没有一个做我一直在想的事情。让我解释一下我想要实现的目标...
在我的网络中,多台计算机共享相同的文件。例如,quickbooks 文件被多台计算机访问,并且它是一个大文件。还有 Outlook Large 中的 PST 文件。每天晚上我们都会通过网络创建已更改文件的备份。我认为如果有一些小的修改,复制整个 1 GB 文件是没有意义的。所以我想提出一种算法来比较文件的各个部分。
例如,假设 Outlook pst 文件由字节组成:
1, 2, 3, 4, 5, 6, 7, 8, 9
如果我收到一封电子邮件,字节现在将是:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10 for example
现在不用发送整个文件,只发送字节 10 会更容易,
所以实际上该文件有数千个字节所以我将对文件的每个兆字节进行校验和,所以现在我的表应该如下所示:
aaa1, aaa2, aaa3, abf8, etc...
如果现在收到电子邮件时,pst 文件的表如下:
aaa1, aaa2, aaa3, 7a8b, etc ... then I know that the first 3 megabits are the same and I should send just one megabite instead of the entire file...
我认为如果内容是在邮件末尾添加的,则此算法将非常有效文件,但实际上一个字节可能会被改变在文件的开头,我的算法将不起作用。例如,如果在文件开头添加一个字节,所有十六进制代码都会改变......
我怎样才能使算法更有效?如果我可以发送文件的一部分而不是整个文件,那就太好了
I know there are several programs out there that will sync files over the network. Non of them do what I have been thinking of. Let me explain what I want to achieve...
In my network several computers share the same files. for example the quickbooks file is accessed by several computers and it is a large file. also there are the pst files from outlook large as well. every night we create a backup over the network of the files that have been changed. I think it does not make sanse to copy a whole 1 gb file if it had some minor modification. so I want to come up with an algorithm that will compare parts of files.
for example let's say that the outlook pst file consists of bytes:
1, 2, 3, 4, 5, 6, 7, 8, 9
if I receive an email the bytes will now be:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10 for example
now instead of sending the whole file it will be easier to send just the byte 10
so in reality the file has thousands of bytes so I will do the checksum of every megabyte of the file so now my table should look like:
aaa1, aaa2, aaa3, abf8, etc...
if when receiving an email now the pst file has a table as:
aaa1, aaa2, aaa3, 7a8b, etc ... then I know that the first 3 megabits are the same and I should send just one megabite instead of the entire file...
I think this algorithm will work great if content was added towards the end of the file but in reality a byte may be changed at the beginning of the file and my algorithm is not going to work. for example if one byte is added at the begining of the file all the hex codes will change...
how can I make the algorithm more efficient? It will be nice if I could send parts of the file instead of the whole file
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
rsync 协议将有效地同步具有微小差异的大文件。它比您设想的方案聪明得多,因此您应该在之前阅读 Tridgell 和 Mackerras 的文章着手您自己的解决方案或仅使用 rsync。 此处有一个免费的 Windows 包装器。
The rsync protocol will efficiently synchronise large files with small differences. It is much cleverer than the scheme you envisage, so you should either read Tridgell and Mackerras's write-up before embarking on your own solution or just use rsync. There's a free Windows wrapper here.
您可能需要研究滚动校验和和rsync 使用的算法。
基本上,您可以按照上面描述的方式在块上计算哈希值,但您还可以计算滚动校验和。滚动校验和具有允许您更有效地检查是否已将一个字节附加到文件开头的属性。
You may want to look into rolling checksums and the algorithm rsync uses.
Basically, you compute a hash as you describe above on a chunk, but you also compute a rolling checksum. The rolling checksum has properties that allow you to more efficiently check that, for example, one byte was appended to the start of the file.