“不可压缩的”数据序列
我想通过算法生成 X MB 的“不可压缩”数据序列。我希望这样做是为了创建一个通过 VPN 连接测量网络速度的程序(避免 VPN 内置压缩)。
有人可以帮助我吗?谢谢!
附言。我需要一个算法,我已经使用了一个压缩到无法再压缩的文件,但现在我需要以编程方式从头开始生成数据序列。
I would like to generate an "uncompressable" data sequence of X MBytes through an algorithm. I want it that way in order to create a program that measures the network speed through VPN connection (avoiding vpn built-in compression).
Can anybody help me? Thanks!
PS. I need an algorithm, I have used a file compressed to the point that cannot be compressed anymore, but now I need to generate the data sequence from scratch programatically.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
白噪声数据是真正随机的,因此不可压缩。
因此,您应该找到一种生成它(或近似值)的算法。
在 Linux 中试试这个:
你可以尝试任何类型的随机数生成......
White noise data is truly random and thus incompressible.
Therefore, you should find an algorithm that generates it (or an approximation).
Try this in Linux:
You might try any kind of random number generation though...
创建统计上难以压缩的数据的一种简单方法就是使用随机数生成器。如果您需要它可重复,请修复种子。任何相当好的随机数生成器都可以。具有讽刺意味的是,如果您知道随机数生成器,则结果的可压缩性令人难以置信:唯一存在的信息就是种子。然而,它会击败任何真正的压缩方法。
One simple approach to creating statistically hard-to-compress data is just to use a random number generator. If you need it to be repeatable, fix the seed. Any reasonably good random number generator will do. Ironically, the result is incredibly compressible if you know the random number generator: the only information present is the seed. However, it will defeat any real compression method.
其他答案指出随机噪声是不可压缩的,好的加密函数的输出尽可能接近随机噪声(除非您知道解密密钥)。因此,一个好的方法可能是仅使用随机数生成器或加密算法来生成不可压缩的数据。
真正不可压缩(通过任何压缩算法)的位串是存在的(对于“不可压缩”的某些正式定义),但即使识别它们在计算上也是无法确定的,更不用说生成它们了。
值得指出的是,“随机数据”之所以不可压缩,是因为没有一种压缩算法可以对所有可能的随机数据平均实现优于 1:1 的压缩比。然而,对于任何特定随机生成的字符串,可能有一种特定的压缩算法确实能够实现良好的压缩比。毕竟,任何可压缩字符串都应该可以从随机生成器输出,包括像全零这样的愚蠢的东西,尽管不太可能。
因此,虽然从随机数生成器或加密算法中获取“可压缩”数据的可能性可能微乎其微,但我想在使用数据之前对其进行实际测试。如果您有权访问 VPN 连接中使用的压缩算法,那是最好的;只是随机生成数据,直到得到无法压缩的数据。否则,只需通过一些常见的压缩工具运行它并检查大小是否不会减小可能就足够了。
Other answers have pointed out that random noise is incompressible, and good encryption functions have output that is as close as possible to random noise (unless you know the decryption key). So a good approach could be to just use random number generators or encryption algorithms to generate your incompressible data.
Genuinely incompressible (by any compression algorithm) bitstrings exist (for certain formal definitions of "incompressible"), but even recognising them is computationally undecidable, let alone generating them.
It's worth pointing out though that "random data" is only incompressible in that there is no compression algorithm that can achieve a compression ratio of better than 1:1 on average over all possible random data. However, for any particular randomly generated string, there may be a particular compression algorithm that does achieve a good compression ratio. After all, any compressible string should be possible output from a random generator, including stupid things like all zeroes, however unlikely.
So while the possibility of getting "compressible" data out of a random number generator or an encryption algorithm is probably vanishingly small, I would want to actually test the data before I use it. If you have access to the compression algorithm(s) used in the VPN connection that would be best; just randomly generate data until you get something that won't compress. Otherwise, just running it through a few common compression tools and checking that the size doesn't decrease would probably be sufficient.
您有几个选择:
1. 使用合适的伪随机数生成器
2. 使用像 AES 这样的加密函数(随处可见的实现)
Algo
如果操作正确,您生成的数据流在数学上将与随机噪声无法区分。
You have a couple of options:
1. Use a decent pseudo-random number generator
2. Use an encryption function like AES (implementations found everywhere)
Algo
If done correctly, the datastream you generate will be mathematically indistinguishable from random noise.
以下程序(C/POSIX)快速生成不可压缩的数据,它应该在每秒千兆字节的范围内。我确信可以使用一般思想来使其更快(也许使用带有 SIMD 的 Djb 的 ChaCha 核心?)。
The following program (C/POSIX) produces incompressible data quickly, it should be in the gigabytes per second range. I'm sure it's possible to use the general idea to make it even faster (maybe using Djb's ChaCha core with SIMD?).
一个非常简单的解决方案是生成一个随机字符串,然后对其进行压缩。
已经压缩的文件是不可压缩的。
A very simple solution is to generate a random string and then compress it.
An already compressed file is incompressible.
对于复制粘贴爱好者,这里有一些 C# 代码来生成具有(几乎)不可压缩内容的文件。代码的核心是 MD5 哈希算法,但任何加密性强(最终结果中良好的随机分布)哈希算法都可以完成这项工作(SHA1、SHA256 等)。
它只是使用文件号字节(在我的机器中为 32 位小端序有符号整数)作为哈希函数的初始输入,并重新哈希和连接输出,直到达到所需的文件大小。因此,文件内容是确定性的(相同的数字总是生成相同的输出),对于正在测试的压缩算法来说是随机分布的“垃圾”。
For copy-paste lovers here some C# code to generate files with (almost) uncompressable content. The heart of the code is the MD5 hashing algorithm but any cryptographically strong (good random distribution in final result) hash algorithm does the job (SHA1, SHA256, etc).
It just use the file number bytes (32 bit little endian signed integer in my machine) as an hash function's initial input and reshashes and concatenates the output until the desired file size reached. So the file content is deterministic (same number always generates same output) randomly distributed "junk" for the compression algorithm under test.
我刚刚创建了一个(非常简单且未优化)C# 控制台应用程序,用于创建不可压缩的文件。
它扫描文件夹中的文本文件(扩展名 .txt),并为每个文本文件创建一个具有相同名称和大小的二进制文件(扩展名 .bin)。
希望这对某人有帮助。
这是 C# 代码:
I just created a (very simple and not optimized) C# console application that creates uncompressable files.
It scans a folder for textfiles (extension .txt) and creates a binary file (extension .bin) with the same name and size for each textfile.
Hope this helps someone.
Here is the C# code: