几周前我在一次演示中看到了它,尝试实施它,但失败了并忘记了它。但现在我想知道它是如何工作的 =)
这是一种有效传输/存储数据的方式。它适用于任何语言。这就是(我认为)它的作用:
您有 1 个非常大的文件(例如网站的整个 javascript 集合)。
- 将其拆分为 48 字节的块
- 对每个 48 字节的块进行哈希处理(例如 MD5)
- 将以 0x00 结尾的哈希值上的块列表拆分
- 大块(>= 1 哈希值)现在应该具有不同的大小。有些很大,有些很小。
- 将这些块粘合在这些散列之间(同样:实际数据大小非常不同)
- 对这些块进行散列
- 现在您有了代表大文件当前版本的散列列表
这个想法是,当大文件中的一段代码发生更改时,只有 1 或 2 个哈希值发生变化。使用新文件,您可以执行上述所有步骤,并且仅上传/下载实际已更改的部分(块,可通过其哈希识别)。根据更改的代码量以及该代码周围的块的大小,您永远不需要下载超过 4 个块。 (而不是整个文件。)通信的另一端然后将用新块替换原始块(相同的算法,相同的功能)。
听起来很熟悉吗?他们提到了一个名字,但找不到任何内容。当我尝试构建它时,它不起作用,因为如果您不完全更改 48 字节 [1],则更改后的所有哈希值 [2] 都会不同......
如果有人知道这个名字:那就太好了。如果有人也能解释一下:完美!
更新
我找到了它所在的演示文稿。它在新产品“Silo”中被提及(并使用):http://research.microsoft.com/apps/pubs/default.aspx?id=131524 相关:http://channel9.msdn.com/Events/MIX/MIX11/RES04 (所以它实际上是 Microsoft 研究!整洁!)
从第一个链接:
启用筒仓的页面使用此本地
存储为 LBFS 风格的 chunkstore。
在第二个链接(视频)中,精彩内容从 6:30
开始。现在我已经看过两次了...我还是不明白=)
关键字是Delta编码
和Rabin指纹
。
I saw it in a presentation a few weeks back, tried to implement it, failed and forgot about it. But now I wanna know how it works =)
It's a way of efficiently transfering/storing data. It would work in any language. This is what (I think) it does:
You have 1 very big file (eg entire javascript collection of a website).
- Split it in blocks of 48 bytes
- Hash every block of 48 bytes (eg. MD5)
- Split the list of blocks on hashes that end with 0x00
- The big blocks (>= 1 hash) should now be different sizes. Some very big, some very small.
- Glue the blocks between those hashes (again: very different sizes of actual data)
- Hash those blocks
- Now you have a list of hashes that represent the current version of the big file
The idea is that when a piece of code changes in the big file, only 1 or 2 hashes change. With the new file, you do all those above steps and you only upload/download the parts (blocks, identifieable by its hash) that have actyally changed. Depending on how much code was changed and on the size of the blocks surrounding that code, you'll never need to download more than 4 blocks. (Instead of the whole file.) The other end of the communication would then replace the original blocks (same algorithm, same functionality) with the new blocks.
Sound familiar? They mentioned a name, but couldn't find anything on it. When I tried to build it, it just didn't work, because if you don't change exactly 48 bytes [1], ALL the hashes after that change [2] are different...
If someonw knows the name: great. If someone could explain it also: perfect!
UPDATE
I found the presentation it was in. It was mentioned (and is used) in a new product "Silo": http://research.microsoft.com/apps/pubs/default.aspx?id=131524 Related: http://channel9.msdn.com/Events/MIX/MIX11/RES04 (So it actually was Microsoft research! Neat!)
From the first link:
A Silo-enabled page uses this local
storage as an LBFS-style chunkstore.
In the second link (a video), the good stuff starts at 6:30
. Now I've seen it twice... I still don't get it =)
Keywords are Delta encoding
and Rabin fingerprints
.
发布评论
评论(3)
这听起来……有点……就像远程差分压缩的工作原理;
PDF http://research.microsoft.com/apps/pubs/default。 aspx?id=64692
This sounds ... kind of ... like how remote differential compression works;
PDF http://research.microsoft.com/apps/pubs/default.aspx?id=64692
您可以使用滚动哈希解决“不是块大小倍数的更改”问题。这就是 rsync 用于仅传输文件已更改部分的方法。
You can solve the "changes which aren't a multiple of the block size" problem using rolling hashes. This is what rsync uses to transfer only changed parts of a file.
这听起来很像 shingling。
That sounds very much like shingling.