注意:我并没有尝试使用 SuperFastHash 并期望它给出相同的输出值作为 CRC32。
我正在写一个简单的 LZSS 压缩/解压缩例程提供非常快速的解压缩并且解压缩时没有内存开销。输入数据被分成 4096 字节长度的块,并按顺序压缩。
我的问题:我想为每个压缩块(块大小 <= 4096 字节)添加一些错误检测。时间限制非常严格,因此校验和例程必须非常快。我避免使用加密算法(MD5、SHA1),因为它们涉及大量计算,并且我选择了 CRC32(一种更简单且明显的算法)。
经过一些测试后,我发现 CRC32 对于我的项目限制来说太慢了。我使用了 此处 的 enwik9(维基百科的 10^9 字节文本转储)。我使用 LZSS 例程对其进行压缩并获得了一个 570Mb 的文件。我测量了以下持续时间(单线程,不包括磁盘 IO,处理前加载到内存中的所有数据,平均 10 次试验):
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) |
|-------------------------------+--------------------------+------------------------|
| Decompression | 6.8 seconds | 6.95 seconds |
| CRC32 on decompressed result | 4.9 seconds | 4.62 seconds |
| CRC32 on compressed result | 2.8 seconds | 2.69 seconds |
然后我测试了 SuperFastHash,只是出于好奇:
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) |
|-------------------------------+--------------------------+------------------------|
| SFH on decompressed result | 1.1 seconds | 1.33 seconds |
| SFH on compressed result | 0.7 seconds | 0.75 seconds |
这是我的 CRC32 实现(我遵循以下描述文档:http://www.ross.net/crc/download/crc_v3.txt) :
# include <stdint.h>
// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t crc32_lookup_table[256] =
{
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
// many lines skipped
// ...
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;
uint32_t crc32_hash(const uint8_t * data, size_t len)
{
uint32_t crc32_register = 0xFFFFFFFF ;
while( len-- )
{
crc32_register = (crc32_register >> 8)
^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
}
return crc32_register ^ 0xFFFFFFFF ;
}
我的问题是:
我可以使用散列而不是循环冗余校验值来执行错误检测吗压缩数据块 ?据我所知(我记得在我的电子课程中),CRC 算法被设计为
当数据通过嘈杂的通道传输时突发错误时非常有效,而从硬盘读取数据的情况则不同。如果我错了,请纠正我。
感谢您的任何建议!
Note : I'm not trying to use SuperFastHash and expecting it to give the same output values as CRC32.
I'm writing a simple LZSS compression/decompression routine to provide very fast decompression and no memory overhead when decompressing. Input data is split into blocks of 4096-bytes length, and compressed sequentially.
My problem : I want to add some error detection for each compressed block (block size <= 4096 bytes). The time constraint is drastic, and therefore the checksum routine must be very fast. I avoided the cryptographic algorithms (MD5, SHA1) because they involve a lot of computations, and I chose CRC32 (a simpler and obvious algorithm).
After making some tests, I found CRC32 too slow regarding my project constraints. I used enwik9 (10^9 bytes text dump of wikipedia) from here. I compressed it using my LZSS routine and obtained a 570Mb file. I measured the following durations (single threaded, disk IO excluded, all data loaded in memory before processing, average of 10 trials) :
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) |
|-------------------------------+--------------------------+------------------------|
| Decompression | 6.8 seconds | 6.95 seconds |
| CRC32 on decompressed result | 4.9 seconds | 4.62 seconds |
| CRC32 on compressed result | 2.8 seconds | 2.69 seconds |
Then I tested SuperFastHash, just by curiosity :
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) |
|-------------------------------+--------------------------+------------------------|
| SFH on decompressed result | 1.1 seconds | 1.33 seconds |
| SFH on compressed result | 0.7 seconds | 0.75 seconds |
And here is my CRC32 implementation (I followed the descriptions from the following document : http://www.ross.net/crc/download/crc_v3.txt) :
# include <stdint.h>
// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t crc32_lookup_table[256] =
{
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
// many lines skipped
// ...
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;
uint32_t crc32_hash(const uint8_t * data, size_t len)
{
uint32_t crc32_register = 0xFFFFFFFF ;
while( len-- )
{
crc32_register = (crc32_register >> 8)
^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
}
return crc32_register ^ 0xFFFFFFFF ;
}
My question is :
Can I use a hash instead of a cyclic redundancy check value to perform error detection in compressed data blocks ? As far as I know (and I remember from my electronics course), CRC algorithms are designed to be
very efficient when errors occur in bursts when data is transmitted over a noisy channel, which is not the case of data read from hard drives. Please correct me if I'm wrong.
Thanks for any advice !
发布评论
评论(3)
人们发现 SuperFastHash 以及其他快速哈希函数 murmur2 存在一些问题。如果您正在寻找针对较大数据块且低冲突的调整,您可以尝试 google city hash (http://code.google.com/p/cityhash/) 或 murmur3 的 128 位变体。还有一些更另类的函数,例如 crap8 和 crapwow,它们声称提供几乎完美的位雪崩和漏斗,因此几乎为零冲突,您可以在此处阅读它和其他非加密哈希函数:http://www.team5150.com/~andrew/noncryptohashzoo/
SuperFastHash has been found to have some problems, along with fellow fast hashing function murmur2. If your looking for something tuned for bigger data blocks with low collisions, you can try the 128 bit variants of google's city hash (http://code.google.com/p/cityhash/ ) or murmur3. There are also some more off the wall ones like crap8 and crapwow which claim to provide almost perfect bit avalanching and funneling and thus have almost zero collisions, you can read up on it and other noncrypto hash functions here: http://www.team5150.com/~andrew/noncryptohashzoo/
由于您的问题与安全无关,因此您可以使用“损坏的”加密哈希函数,该函数对于有感知的攻击者来说并不安全,但仍然非常擅长检测传输错误。我正在考虑 MD4,经测量,它在某些平台上比 CRC32 更快。您可能还想查看RadioGatún 和Panama;有关各种加密哈希函数的 C 和 Java 开源实现,请参阅此库。
如果您的目标架构是最新/足够大的 x86 CPU,且具有 AES-NI 指令,那么你可以通过简单地计算 CBC-MAC 来制作一个非常快且非常好的校验和使用分组密码 AES 和传统密钥(例如全零密钥);因为这不是为了安全,您甚至可以使用比标准 AES 更少的轮数(例如 5 轮而不是标准 10 轮)。
Since your problem is not about security, you can use "broken" cryptographic hash functions, which are not secure against a sentient attacker, but still very good at detecting transmission errors. I am thinking about MD4, which has been measured to be faster than CRC32 on some platforms. You may also want to check RadioGatún and Panama; see this library for opensource implementations in C and Java of various cryptographic hash functions.
If your target architecture is a recent/big enough x86 CPU which features the AES-NI instructions, then you could make a devilishly fast and very good checksum by simply computing a CBC-MAC with block cipher AES and a conventional key (e.g. an all-zero key); since this is not for security, you could even use less rounds than standard AES (e.g. 5 rounds instead of the standard 10).
哈希的设计目的是即使输入发生很小的变化,也会导致结果发生很大的变化。
我认为SuperFastHash有这个属性。它可能更容易受到冲突的影响(因为社区似乎很少分析它),但它不应该阻止您想要的使用。
祝你好运 :)
Hashes are designed to result in a great change of the outcome even with very small alterations on the input.
I think that the SuperFastHash has this propertie. It might be a bit more vulnerable to collisions (since it seems to be less analyzed by the community), but it shouldn't prevent the use that you have in mind.
Good luck :)