这个哈希/缓存/版本控制算法的名称是什么?

发布于 2024-11-03 22:56:33 字数 1293 浏览 1 评论 0 原文

几周前我在一次演示中看到了它,尝试实施它,但失败了并忘记了它。但现在我想知道它是如何工作的 =)

这是一种有效传输/存储数据的方式。它适用于任何语言。这就是(我认为)它的作用:

您有 1 个非常大的文件(例如网站的整个 javascript 集合)。

  1. 将其拆分为 48 字节的块
  2. 对每个 48 字节的块进行哈希处理(例如 MD5)
  3. 将以 0x00 结尾的哈希值上的块列表拆分
  4. 大块(>= 1 哈希值)现在应该具有不同的大小。有些很大,有些很小。
  5. 将这些块粘合在这些散列之间(同样:实际数据大小非常不同)
  6. 对这些块进行散列
  7. 现在您有了代表大文件当前版本的散列列表

这个想法是,当大文件中的一段代码发生更改时,只有 1 或 2 个哈希值发生变化。使用新文件,您可以执行上述所有步骤,并且仅上传/下载实际已更改的部分(块,可通过其哈希识别)。根据更改的代码量以及该代码周围的块的大小,您永远不需要下载超过 4 个块。 (而不是整个文件。)通信的另一端然后将用新块替换原始块(相同的算法,相同的功能)。

听起来很熟悉吗?他们提到了一个名字,但找不到任何内容。当我尝试构建它时,它不起作用,因为如果您不完全更改 48 字节 [1],则更改后的所有哈希值 [2] 都会不同......

如果有人知道这个名字:那就太好了。如果有人也能解释一下:完美!

更新
我找到了它所在的演示文稿。它在新产品“Silo”中被提及(并使用):http://research.microsoft.com/apps/pubs/default.aspx?id=131524 相关:http://channel9.msdn.com/Events/MIX/MIX11/RES04 (所以它实际上是 Microsoft 研究!整洁!)

从第一个链接:

启用筒仓的页面使用此本地 存储为 LBFS 风格的 chunkstore。

在第二个链接(视频)中,精彩内容从 6:30 开始。现在我已经看过两次了...我还是不明白=)

关键字是Delta编码Rabin指纹

I saw it in a presentation a few weeks back, tried to implement it, failed and forgot about it. But now I wanna know how it works =)

It's a way of efficiently transfering/storing data. It would work in any language. This is what (I think) it does:

You have 1 very big file (eg entire javascript collection of a website).

  1. Split it in blocks of 48 bytes
  2. Hash every block of 48 bytes (eg. MD5)
  3. Split the list of blocks on hashes that end with 0x00
  4. The big blocks (>= 1 hash) should now be different sizes. Some very big, some very small.
  5. Glue the blocks between those hashes (again: very different sizes of actual data)
  6. Hash those blocks
  7. Now you have a list of hashes that represent the current version of the big file

The idea is that when a piece of code changes in the big file, only 1 or 2 hashes change. With the new file, you do all those above steps and you only upload/download the parts (blocks, identifieable by its hash) that have actyally changed. Depending on how much code was changed and on the size of the blocks surrounding that code, you'll never need to download more than 4 blocks. (Instead of the whole file.) The other end of the communication would then replace the original blocks (same algorithm, same functionality) with the new blocks.

Sound familiar? They mentioned a name, but couldn't find anything on it. When I tried to build it, it just didn't work, because if you don't change exactly 48 bytes [1], ALL the hashes after that change [2] are different...

If someonw knows the name: great. If someone could explain it also: perfect!

UPDATE
I found the presentation it was in. It was mentioned (and is used) in a new product "Silo": http://research.microsoft.com/apps/pubs/default.aspx?id=131524 Related: http://channel9.msdn.com/Events/MIX/MIX11/RES04 (So it actually was Microsoft research! Neat!)

From the first link:

A Silo-enabled page uses this local
storage as an LBFS-style chunkstore.

In the second link (a video), the good stuff starts at 6:30. Now I've seen it twice... I still don't get it =)

Keywords are Delta encoding and Rabin fingerprints.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

月亮是我掰弯的 2024-11-10 22:56:33

这听起来……有点……就像远程差分压缩的工作原理;

在低带宽文件系统中
(LBFS)[24],使用RDC协议
优化之间的沟通
发送者和接收者
双方将所有的
文件分成块并计算能力强
每个的校验和或签名
块。当客户端需要访问时
或从服务器复制文件,
后者首先传输列表
该文件的签名
客户端,它决定了它的哪一个
旧的块可以用来重建
新文件,并请求
缺少块。这的关键
协议是文件是
在客户端上独立划分
和服务器,通过确定块
数据特征的边界。

PDF http://research.microsoft.com/apps/pubs/default。 aspx?id=64692

This sounds ... kind of ... like how remote differential compression works;

In the Low Bandwidth File System
(LBFS) [24], an RDC protocol is used
to optimize the communication between
a sender and a recipient by having
both sides subdivide all of their
files into chunks and compute strong
checksums, or signatures, for each
chunk. When a client needs to access
or copy a file from the server, the
latter first transmits the list of
signatures for that file to the
client, which determines which of its
old chunks may be used to reconstruct
the new file, and requests the
missing chunks. The key to this
protocol is that the files are
divided independently on the client
and server, by determining chunk
boundaries from data features.

PDF http://research.microsoft.com/apps/pubs/default.aspx?id=64692

等往事风中吹 2024-11-10 22:56:33

您可以使用滚动哈希解决“不是块大小倍数的更改”问题。这就是 rsync 用于仅传输文件已更改部分的方法。

You can solve the "changes which aren't a multiple of the block size" problem using rolling hashes. This is what rsync uses to transfer only changed parts of a file.

蝶…霜飞 2024-11-10 22:56:33

这听起来很像 shingling

That sounds very much like shingling.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文