分布式文件压缩

发布于 2024-08-15 22:08:28 字数 1371 浏览 8 评论 0原文

我一直在考虑数据冗余，只是想在继续这样做之前将所有内容都以书面形式扔掉（并且进一步检查这个想法是否已经付诸实践）。

好吧，就这样吧。

互联网上充满了冗余数据，包括文本、图像、视频等。因此，人们在通过 HTTP 进行 gzip 和 bzip2 即时压缩和解压缩方面投入了大量精力。像 Google 和 Facebook 这样的大型网站有整个团队致力于提高页面加载速度。

我的“问题”涉及这样一个事实：压缩仅在每个文件的基础上完成（gzip file.txt 产生file.txt.gz ）。毫无疑问，散落在互联网上的看似无关的数据之间存在许多共同点。如果您可以存储这些常见块并在客户端或服务器端组合它们以动态生成内容，会怎么样？

为了能够做到这一点，人们必须在互联网上找到最常见的数据“块”。这些块可以是任何大小（这里可能有一个最佳选择），并且组合起来需要能够表达任何可以想象的数据。

出于说明目的，假设我们有以下 5 个公共数据块 - a、b、c、d 和 e。我们有两个文件仅包含这些块。我们有名为 chunk 和 combine 的程序。 chunk 获取数据，通过 bzip2、gzip 或其他压缩算法对其进行压缩，并输出包含所述数据的块（压缩后）。 combine 扩展块并解压缩连接的结果。它们的使用方式如下：

$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
$ chunk gettysburg.txt test.txt
$ cat gettysburg.txt.ck
abdbdeabcbdbe
$ cat test.txt.ck
abdeacccde
$ combine gettysburg.txt.ck test.txt.ck
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"

例如，当通过 HTTP 发送文件时，服务器可以将数据分块并将其发送到客户端，然后客户端能够组合 > 分块数据并渲染它。

以前有人尝试过吗？如果不是，我想知道为什么，如果是，请发布您如何使这项工作成功。一个好的第一步是详细说明如何找出这些块是什么。一旦我们弄清楚了如何获取块，那么我们就会弄清楚chunk和combine这两个程序如何工作。

我可能会对此给予赏金（取决于接待情况），因为我认为这是一个非常有趣的问题，具有现实世界的影响。

原文

I've been doing some thinking about data redundancy, and just wanted to throw everything out in writing before I kept going along with this (and furthermore to double check whether or not this idea has already been put into practice).

Alright, so here goes.

The internet is filled with redundant data, including text, images, videos, etc. A lot of effort has gone into gzip and bzip2 on-the-fly compression and decompression over HTTP as a result. Large sites like Google and Facebook have entire teams that devote their time to making their pages load more quickly.

My 'question' relates to the fact that compression is done solely on a per file basis (gzip file.txt yields file.txt.gz). Without a doubt there are many commonalities between seemingly unrelated data scattered around the Internet. What if you could store these common chunks and combine them, either client-side or server-side, to dynamically generate content?

To be able to do this, one would have to find the most common 'chunks' of data on the Internet. These chunks could be any size (there's probably an optimal choice here) and, in combination, would need to be capable of expressing any data imaginable.

For illustrative purposes, let's say we have the following 5 chunks of common data - a, b, c, d, and e. We have two files that only contain these chunks. We have programs called chunk and combine. chunk takes data, compresses it through bzip2, gzip, or some other compression algorithm, and outputs the chunks that comprise said data (after compression). combine expands the chunks and decompresses the concatenated result. Here's how they might be used:

$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
$ chunk gettysburg.txt test.txt
$ cat gettysburg.txt.ck
abdbdeabcbdbe
$ cat test.txt.ck
abdeacccde
$ combine gettysburg.txt.ck test.txt.ck
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"

When sending a file through HTTP, for instance, the server could chunk the data and send it to the client, who then has the capability to combine the chunked data and render it.

Has anyone attempted this before? If not I would like to know why, and if so, please post how you might make this work. A nice first step would be to detail how you might figure out what these chunks are. Once we've figured out how to get the chunks, then we figure out how these two programs, chunk and combine, might work.

I'll probably put a bounty on this (depending upon reception) because I think this is a very interesting problem with real-world implications.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三生路 2024-08-22 22:08:28

您问是否有人以前做过类似的事情以及块大小应该是多少，我想我应该向您指出我想到的两篇论文：

（Google 的一个团队正在尝试加快速度）通过利用文档之间共享的数据来发出 Web 请求。服务器将预先计算的字典传送给客户端，其中包含文档之间通用的数据并在以后的请求中引用。这一次仅适用于单个域，并且目前仅适用于 Google Chrome：共享词典HTTP 压缩
（微软的一个团队）在他们的工作中确定使用远程差分压缩在有限带宽网络上优化文件复制对于文件系统同步的情况，块大小约为2KiB 效果很好。他们使用一定程度的间接性，以便重新创建文件所需的块列表本身被分成多个块——这篇论文读起来很有趣，并且可能会给你关于如何完成事情的新想法。

不确定它是否对您有帮助，但这里是以防万一。 :-)

回复收藏 0 原文

把时间冻结 2024-08-22 22:08:28

您实际上不必分析最常见的块 - 事实上，这种分布式决策可能真的非常困难。是这样的：

让我们以 HTTP 数据传输为例。将每个文件分成 10MiB 块（或您关心的任何大小，我确信每种方式都会对性能产生影响）并计算它们的 SHA-256（或您相当确定应该不会发生冲突的一些哈希值）

例如，您有文件 F1，其中包含块 B1..Bn 和校验和 C1..Cn。现在，HTTP 服务器可以简单地使用列表 C1..Cn 来响应对文件 F1 的请求。

为了使其真正有用，客户端必须保留已知块的注册表 - 如果校验和已经存在，只需在本地获取该块。完毕。如果未知，可以从本地缓存中获取它，也可以从刚刚获取校验和列表的远程 HTTP 服务器中获取块。

如果您从恰好共享一个块的任何服务器（甚至是完全不同的服务器）下载另一个文件，则您已经下载了该文件，并且它与您选择的哈希算法一样安全。

现在，这并没有解决存在偏移的情况（例如，一个文件是一个文件

AAAAAAAA

，另一个文件

BAAAAAAAA

是压缩算法可能可以处理的。但也许如果您压缩块本身，您会发现无论如何您都获得了大部分节省...

想法？

You don't really have to analyze it for the most common chunks - in fact, such distributed decision making could really be quite hard. How's something like this:

Let's take the case of HTTP data transfer. Chunk each file into 10MiB blocks (or whatever size you care to, I'm sure that there are performance implications each way) and compute their SHA-256 (or some hash which you are fairly sure should be safe against collisions)

For example, you have file F1 with blocks B1..Bn and checksums C1..Cn. Now, the HTTP server can respond to a request for file F1 with simply the list C1..Cn

To make this actually useful, the client has to keep a registry of known blocks - if the checksum is already there, just fetch the block locally. Done. If it's not known, either grab it from a local cache or just fetch the blocks from the remote HTTP server you just got the checksum list from.

If you ever download another file from any server (even a totally different one) which happens to share a block, you already have it downloaded and it's as secure as the hash algorithm you chose.

Now this doesn't address the case where there are offsets (e.g. one file is

AAAAAAAA

and the other

BAAAAAAAA

which a compression algorithm probably could deal with. But maybe if you compressed the blocks themselves, you'd find that you get most of the savings anyway...

Thoughts?

回复收藏 0 原文