使用 SuperFastHash 代替 CRC32？

发布于 2024-12-03 05:08:44 字数 2570 浏览 8 评论 0 原文

注意：我并没有尝试使用 SuperFastHash 并期望它给出相同的输出值作为 CRC32。

我正在写一个简单的 LZSS 压缩/解压缩例程提供非常快速的解压缩并且解压缩时没有内存开销。输入数据被分成 4096 字节长度的块，并按顺序压缩。

我的问题：我想为每个压缩块（块大小 <= 4096 字节）添加一些错误检测。时间限制非常严格，因此校验和例程必须非常快。我避免使用加密算法（MD5、SHA1），因为它们涉及大量计算，并且我选择了 CRC32（一种更简单且明显的算法）。

经过一些测试后，我发现 CRC32 对于我的项目限制来说太慢了。我使用了此处的 enwik9（维基百科的 10^9 字节文本转储）。我使用 LZSS 例程对其进行压缩并获得了一个 570Mb 的文件。我测量了以下持续时间（单线程，不包括磁盘 IO，处理前加载到内存中的所有数据，平均 10 次试验）：

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|        Decompression          |        6.8 seconds       |      6.95 seconds      |
|  CRC32 on decompressed result |        4.9 seconds       |      4.62 seconds      |
|   CRC32 on compressed result  |        2.8 seconds       |      2.69 seconds      |

然后我测试了 SuperFastHash，只是出于好奇：

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|  SFH on decompressed result   |        1.1 seconds       |      1.33 seconds      |
|   SFH on compressed result    |        0.7 seconds       |      0.75 seconds      |

这是我的 CRC32 实现（我遵循以下描述文档：http://www.ross.net/crc/download/crc_v3.txt) ：

# include <stdint.h>

// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t  crc32_lookup_table[256] =
{
    0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
    0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
    0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
    // many lines skipped
    // ...
    0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;

uint32_t crc32_hash(const uint8_t * data, size_t len)
{
    uint32_t crc32_register = 0xFFFFFFFF ;
    while( len-- )
    {
        crc32_register = (crc32_register >> 8)
                       ^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
    }
    return crc32_register ^ 0xFFFFFFFF ;
}

我的问题是：

我可以使用散列而不是循环冗余校验值来执行错误检测吗压缩数据块 ?据我所知（我记得在我的电子课程中），CRC 算法被设计为当数据通过嘈杂的通道传输时突发错误时非常有效，而从硬盘读取数据的情况则不同。如果我错了，请纠正我。

感谢您的任何建议！

原文

Note : I'm not trying to use SuperFastHash and expecting it to give the same output values as CRC32.

I'm writing a simple LZSS compression/decompression routine to provide very fast decompression and no memory overhead when decompressing. Input data is split into blocks of 4096-bytes length, and compressed sequentially.

My problem : I want to add some error detection for each compressed block (block size <= 4096 bytes). The time constraint is drastic, and therefore the checksum routine must be very fast. I avoided the cryptographic algorithms (MD5, SHA1) because they involve a lot of computations, and I chose CRC32 (a simpler and obvious algorithm).

After making some tests, I found CRC32 too slow regarding my project constraints. I used enwik9 (10^9 bytes text dump of wikipedia) from here. I compressed it using my LZSS routine and obtained a 570Mb file. I measured the following durations (single threaded, disk IO excluded, all data loaded in memory before processing, average of 10 trials) :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|        Decompression          |        6.8 seconds       |      6.95 seconds      |
|  CRC32 on decompressed result |        4.9 seconds       |      4.62 seconds      |
|   CRC32 on compressed result  |        2.8 seconds       |      2.69 seconds      |

Then I tested SuperFastHash, just by curiosity :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|  SFH on decompressed result   |        1.1 seconds       |      1.33 seconds      |
|   SFH on compressed result    |        0.7 seconds       |      0.75 seconds      |

And here is my CRC32 implementation (I followed the descriptions from the following document : http://www.ross.net/crc/download/crc_v3.txt) :

# include <stdint.h>

// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t  crc32_lookup_table[256] =
{
    0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
    0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
    0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
    // many lines skipped
    // ...
    0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;

uint32_t crc32_hash(const uint8_t * data, size_t len)
{
    uint32_t crc32_register = 0xFFFFFFFF ;
    while( len-- )
    {
        crc32_register = (crc32_register >> 8)
                       ^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
    }
    return crc32_register ^ 0xFFFFFFFF ;
}

My question is :

Can I use a hash instead of a cyclic redundancy check value to perform error detection in compressed data blocks ? As far as I know (and I remember from my electronics course), CRC algorithms are designed to be
very efficient when errors occur in bursts when data is transmitted over a noisy channel, which is not the case of data read from hard drives. Please correct me if I'm wrong.

Thanks for any advice !

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丑疤怪 2024-12-10 05:08:44

人们发现 SuperFastHash 以及其他快速哈希函数 murmur2 存在一些问题。如果您正在寻找针对较大数据块且低冲突的调整，您可以尝试 google city hash (http://code.google.com/p/cityhash/) 或 murmur3 的 128 位变体。还有一些更另类的函数，例如 crap8 和 crapwow，它们声称提供几乎完美的位雪崩和漏斗，因此几乎为零冲突，您可以在此处阅读它和其他非加密哈希函数：http://www.team5150.com/~andrew/noncryptohashzoo/

回复收藏 0 原文

不知在何时 2024-12-10 05:08:44

由于您的问题与安全无关，因此您可以使用“损坏的”加密哈希函数，该函数对于有感知的攻击者来说并不安全，但仍然非常擅长检测传输错误。我正在考虑 MD4，经测量，它在某些平台上比 CRC32 更快。您可能还想查看RadioGatún 和Panama；有关各种加密哈希函数的 C 和 Java 开源实现，请参阅此库。

如果您的目标架构是最新/足够大的 x86 CPU，且具有 AES-NI 指令，那么你可以通过简单地计算 CBC-MAC 来制作一个非常快且非常好的校验和使用分组密码 AES 和传统密钥（例如全零密钥）；因为这不是为了安全，您甚至可以使用比标准 AES 更少的轮数（例如 5 轮而不是标准 10 轮）。