分布式文件压缩
我一直在考虑数据冗余,只是想在继续这样做之前将所有内容都以书面形式扔掉(并且进一步检查这个想法是否已经付诸实践)。
好吧,就这样吧。
互联网上充满了冗余数据,包括文本、图像、视频等。因此,人们在通过 HTTP 进行 gzip 和 bzip2 即时压缩和解压缩方面投入了大量精力。像 Google 和 Facebook 这样的大型网站有整个团队致力于提高页面加载速度。
我的“问题”涉及这样一个事实:压缩仅在每个文件的基础上完成(gzip file.txt
产生file.txt.gz
)。毫无疑问,散落在互联网上的看似无关的数据之间存在许多共同点。如果您可以存储这些常见块并在客户端或服务器端组合它们以动态生成内容,会怎么样?
为了能够做到这一点,人们必须在互联网上找到最常见的数据“块”。这些块可以是任何大小(这里可能有一个最佳选择),并且组合起来需要能够表达任何可以想象的数据。
出于说明目的,假设我们有以下 5 个公共数据块 - a、b、c、d 和 e
。我们有两个文件仅包含这些块。我们有名为 chunk
和 combine
的程序。 chunk 获取数据,通过 bzip2、gzip 或其他压缩算法对其进行压缩,并输出包含所述数据的块(压缩后)。 combine
扩展块并解压缩连接的结果。它们的使用方式如下:
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
$ chunk gettysburg.txt test.txt
$ cat gettysburg.txt.ck
abdbdeabcbdbe
$ cat test.txt.ck
abdeacccde
$ combine gettysburg.txt.ck test.txt.ck
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
例如,当通过 HTTP 发送文件时,服务器可以将数据分块
并将其发送到客户端,然后客户端能够组合
> 分块数据并渲染它。
以前有人尝试过吗?如果不是,我想知道为什么,如果是,请发布您如何使这项工作成功。一个好的第一步是详细说明如何找出这些块是什么。一旦我们弄清楚了如何获取块,那么我们就会弄清楚chunk
和combine
这两个程序如何工作。
我可能会对此给予赏金(取决于接待情况),因为我认为这是一个非常有趣的问题,具有现实世界的影响。
I've been doing some thinking about data redundancy, and just wanted to throw everything out in writing before I kept going along with this (and furthermore to double check whether or not this idea has already been put into practice).
Alright, so here goes.
The internet is filled with redundant data, including text, images, videos, etc. A lot of effort has gone into gzip and bzip2 on-the-fly compression and decompression over HTTP as a result. Large sites like Google and Facebook have entire teams that devote their time to making their pages load more quickly.
My 'question' relates to the fact that compression is done solely on a per file basis (gzip file.txt
yields file.txt.gz
). Without a doubt there are many commonalities between seemingly unrelated data scattered around the Internet. What if you could store these common chunks and combine them, either client-side or server-side, to dynamically generate content?
To be able to do this, one would have to find the most common 'chunks' of data on the Internet. These chunks could be any size (there's probably an optimal choice here) and, in combination, would need to be capable of expressing any data imaginable.
For illustrative purposes, let's say we have the following 5 chunks of common data - a, b, c, d, and e
. We have two files that only contain these chunks. We have programs called chunk
and combine
. chunk
takes data, compresses it through bzip2, gzip, or some other compression algorithm, and outputs the chunks that comprise said data (after compression). combine
expands the chunks and decompresses the concatenated result. Here's how they might be used:
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
$ chunk gettysburg.txt test.txt
$ cat gettysburg.txt.ck
abdbdeabcbdbe
$ cat test.txt.ck
abdeacccde
$ combine gettysburg.txt.ck test.txt.ck
$ cat gettysburg.txt
"Four score and seven years ago...cont'd"
$ cat test.txt
"This is a test"
When sending a file through HTTP, for instance, the server could chunk
the data and send it to the client, who then has the capability to combine
the chunked data and render it.
Has anyone attempted this before? If not I would like to know why, and if so, please post how you might make this work. A nice first step would be to detail how you might figure out what these chunks are. Once we've figured out how to get the chunks, then we figure out how these two programs, chunk
and combine
, might work.
I'll probably put a bounty on this (depending upon reception) because I think this is a very interesting problem with real-world implications.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您问是否有人以前做过类似的事情以及块大小应该是多少,我想我应该向您指出我想到的两篇论文:
(Google 的一个团队正在尝试加快速度)通过利用文档之间共享的数据来发出 Web 请求。服务器将预先计算的字典传送给客户端,其中包含文档之间通用的数据并在以后的请求中引用。这一次仅适用于单个域,并且目前仅适用于 Google Chrome:共享词典HTTP 压缩
(微软的一个团队)在他们的工作中确定使用远程差分压缩在有限带宽网络上优化文件复制 对于文件系统同步的情况,块大小约为2KiB 效果很好。他们使用一定程度的间接性,以便重新创建文件所需的块列表本身被分成多个块——这篇论文读起来很有趣,并且可能会给你关于如何完成事情的新想法。
不确定它是否对您有帮助,但这里是以防万一。 :-)
You asked if someone had done something similar before and what the chunk size ought to be, and I thought I'd point you to the two papers that came to my mind:
(A team at) Google is trying to speed up web requests by exploiting data that is shared between documents. The server communicates a pre-computed dictionary to the client, which contains data that is common between documents and is referenced on later requests. This only works for a single domain at a time, and -- currently -- only with Google Chrome: Shared Dictionary Compression Over HTTP
(A team at) Microsoft determined in their work Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression that for their case of filesystem synchronization a chunk size of about 2KiB works well. They use a level of indirection, so that the list of chunks needed to recreate a file is itself split into chunks -- the paper is fascinating to read, and might give you new ideas about how things might be done.
Not sure if it helps you, but here it is in case it does. :-)
您实际上不必分析最常见的块 - 事实上,这种分布式决策可能真的非常困难。是这样的:
让我们以 HTTP 数据传输为例。将每个文件分成 10MiB 块(或您关心的任何大小,我确信每种方式都会对性能产生影响)并计算它们的 SHA-256(或您相当确定应该不会发生冲突的一些哈希值)
例如,您有文件 F1,其中包含块 B1..Bn 和校验和 C1..Cn。现在,HTTP 服务器可以简单地使用列表 C1..Cn 来响应对文件 F1 的请求。
为了使其真正有用,客户端必须保留已知块的注册表 - 如果校验和已经存在,只需在本地获取该块。完毕。如果未知,可以从本地缓存中获取它,也可以从刚刚获取校验和列表的远程 HTTP 服务器中获取块。
如果您从恰好共享一个块的任何服务器(甚至是完全不同的服务器)下载另一个文件,则您已经下载了该文件,并且它与您选择的哈希算法一样安全。
现在,这并没有解决存在偏移的情况(例如,一个文件是一个文件
,另一个文件
是压缩算法可能可以处理的。但也许如果您压缩块本身,您会发现无论如何您都获得了大部分节省...
想法?
You don't really have to analyze it for the most common chunks - in fact, such distributed decision making could really be quite hard. How's something like this:
Let's take the case of HTTP data transfer. Chunk each file into 10MiB blocks (or whatever size you care to, I'm sure that there are performance implications each way) and compute their SHA-256 (or some hash which you are fairly sure should be safe against collisions)
For example, you have file F1 with blocks B1..Bn and checksums C1..Cn. Now, the HTTP server can respond to a request for file F1 with simply the list C1..Cn
To make this actually useful, the client has to keep a registry of known blocks - if the checksum is already there, just fetch the block locally. Done. If it's not known, either grab it from a local cache or just fetch the blocks from the remote HTTP server you just got the checksum list from.
If you ever download another file from any server (even a totally different one) which happens to share a block, you already have it downloaded and it's as secure as the hash algorithm you chose.
Now this doesn't address the case where there are offsets (e.g. one file is
and the other
which a compression algorithm probably could deal with. But maybe if you compressed the blocks themselves, you'd find that you get most of the savings anyway...
Thoughts?
有一种更简单的方法来处理文本数据。目前,我们将文本存储为代表声音的字母流。然而,语言的单位是单词而不是声音。因此,如果我们有一个包含所有单词的字典,然后将指向这些单词的“指针”存储在文件中,我们就可以通过使用指针并查找单词列表来动态地重新构造文本。
这应该会立即将物体的尺寸减小 3 或 4 倍。在这种方法中,单词与你脑海中的块相同。下一步是常见的单词组,例如“这是”、“我是”、“满月”、“认真的家伙”、“哦宝贝”等。
单词列表也有助于拼写检查,应该由操作系统实现。拼写检查器不成为操作系统的一部分有什么原因吗?
There is an easier way to deal with textual data. Currently, we store text as streams of letters which represent sounds. However, the unit of language is word not sound. Therefore, if we have a dictionary of all the words and then store "pointers" to such words in files, we can dynamically re-constitute the text by using the pointers and looking up the word list.
This should reduce the size of things by a factor of 3 or 4 right away. In this method, words are the same as chunks you have in mind. Next step is the common word groups such as "this is", "i am", "full moon", "seriously dude", "oh baby" etc.
A word list also helps for spell checking and should be implemented by Operating System. Is there any reason why spell checkers are not part of the operating system?
与您的答案不完全相关,但您已经看到了这一点。 Microsoft(和其他公司)已经提供边缘网络来托管 jquery 库。您可以引用这些相同的 URI,并获得用户从不同站点访问该文件及其浏览器缓存该文件的好处。
但是,您在过去 20 分钟内引用了多少其他人引用的内容(任意数字)?您可能会在一家大公司看到一些好处,那里有很多员工共享一个应用程序,但否则我认为您将很难确定您想要的块,这将超过共享它的任何好处。
Not exactly related to your answer but you already see this. Microsoft (and others) already provide edge networks to host the jquery libraries. You can refer to these same URIs and get the benefits of the user having accessed the file from a different site and his browser caching it.
However, how much content do you refer to that someone else has referred to in the past 20 minutes (an arbitrary number.)? You might see some benefit at a large company where lots of employees are sharing an application but otherwise I think you'd have a hard time DETERMINING the chunk you want and that would outweigh any benefit to sharing it.