仅使用校验和传输文件?
是否可以仅使用校验和系统来传输大文件,然后通过计算重建原始文件?
假设您传输文件的 MD5 校验和以及文件的大小。通过创建“虚拟文件”并计算其校验和,尝试每一个位组合,您最终应该“到达”原始文件。但在途中你也会遇到很多“冲突”,其中校验和也匹配。
因此,我们将原始文件的第一个字节更改为某个指定值,再次计算校验和,然后将其发送。如果我们在虚拟文件中进行相同的替换,我们可以测试每个“碰撞”以查看它是否仍然匹配。这应该会缩小范围,我们可以这样做几次。
当然,执行此操作所需的计算能力将是巨大的。但理论上可能吗?传输某些内容(比如 1mb)需要多少个校验和?或者传输校验和所需的数据量可能几乎与文件一样大,从而使其毫无意义?
Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您需要传输的数据量肯定与文件的大小相同。考虑一下:如果您可以使用
n-1
字节数据传输n
字节文件,则意味着您拥有256^(n-1) 您可能已发送的数据的可能模式,但正在从大小为
256^n
的空间中进行选择。这意味着每 256 个文件中就有一个无法使用此方法来表达 - 这通常称为 皮德贡霍尔原理。现在,即使这不是问题,也不能保证在任何给定数量的校验和之后不会发生冲突。校验和算法旨在避免冲突,但对于大多数校验和/哈希算法来说,没有强有力的证据表明经过 X 次哈希后可以保证 N 字节空间中不会发生冲突。
最后,散列算法至少被设计为难以逆转,因此即使有可能,也需要耗费巨大的 CPU 能力才能做到这一点。
也就是说,对于类似的方法,您可能有兴趣阅读前向纠错代码 - 它们根本不是哈希算法,但我认为您可能会发现它们很有趣。
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a
n
byte file withn-1
bytes of data, that means you've got256^(n-1)
possible patterns of data you may have sent, but are selecting from a space of size256^n
. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
这里有一个信息问题。校验和对于特定的数据集不一定是唯一的,事实上,因此它实际上需要有许多位信息作为源。它可以表明接收到的数据并不是生成校验和的确切数据,但在大多数情况下它无法证明这一点。
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
简而言之“不”。
举一个假设的例子,考虑一张具有 6 个像素的 24 bpp 照片 - 这六个像素上的每个颜色通道有 2^(24 * 6) (2^144) 种可能的强度组合,因此您可以保证,如果您如果要评估每种可能性,则可以保证 MD5 冲突(因为 MD5 是 128 位数字)。
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
简短回答:不以任何有意义的形式。
长答案:
让我们假设一个大小为 1000 字节的任意文件
file.bin
。有2^(8*1000)
不同的组合可能是其实际内容。通过发送例如 1000 位校验和,您仍然有大约
2^(7*1000)
相互冲突的替代方案。通过发送一个额外的位,您可能可以将这些数量减少一半......并且您仍然有
2^6999
冲突。当您消除冲突时,您将发送至少 8000 位,即等于或大于文件大小的数量。理论上可行的唯一方法(注意:我没有说“可行”,更不用说“实用”)是文件并不真正包含随机数据,并且您可以使用这些知识来修剪替代方案。在这种情况下,你最好还是使用压缩。内容感知压缩算法(例如用于音频的FLAC)使用有关音频属性的先验知识输入数据以提高压缩比。
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file
file.bin
with a 1000-byte size. There are2^(8*1000)
different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,you still have about
2^(7*1000)
colliding alternatives.By sending a single additional bit, you might be able cut those down by half... and you still have
2^6999
collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
我认为你想到的其实是一个有趣的话题,但你没有找到正确的方法。如果我可以尝试重新表述您的问题,那么您是在问是否有一种方法可以将函数应用于某些数据,传输函数的结果,然后从简洁的函数结果重建原始数据。对于单个 MD5 校验和,答案是否定的,但对于其他函数,只要您愿意发送多个函数结果,这是可能的。一般来说,这个研究领域被称为压缩感知。有时精确重建是可能的,但更常见的是,它被用作图像和其他视觉或声音数据的有损压缩方案。
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.