具有 PCLMULQDQ 的快速 CRC 未反映

发布于 2025-01-11 13:38:57 字数 3464 浏览 0 评论 0原文

我正在尝试写一个 PCLMULQDQ 优化的 CRC-32 实现。特定的 CRC-32 变体适用于我不拥有的变体，但我试图以库形式提供支持。在 crcany model 形式中，它具有以下参数：

width=32 poly=0xaf init=0xffffffff refin=false refout=false xorout=0x00000000 check=0xa5fd3138 （省略了我认为是 0x00000000 的残差，但老实说不知道它是什么）

该算法的基本非基于表/按位实现（由 crcany 生成））是：

uint32_t crc32byond_bit(uint32_t crc, void const *mem, size_t len) {
    unsigned char const *data = mem;
    if (data == NULL)
        return 0xffffffff;
    for (size_t i = 0; i < len; i++) {
        crc ^= (uint32_t)data[i] << 24;
        for (unsigned k = 0; k < 8; k++) {
            crc = crc & 0x80000000 ? (crc << 1) ^ 0xaf : crc << 1;
        }
    }
    return crc;
}

首先，我从根本上不理解描述该算法的论文。我不知道类似 K1 = [x^(512+64) mod P(x)] 的含义。（x是什么？它从哪里来？我不知道。）我是一名专业的软件工程师；我是一名软件工程师。不是学者。坦白说，我根本不理解这个符号，维基百科文章也没有做到对我来说很重要。也许是我在线性代数方面的弱点再次困扰着我。

我知道一些我希望利用的公共实现：

但是：

他们使用位反射算法，我不确定如何创建非反射变体。
他们好像不看报纸？例如，wuffs和crc32fast特别指出他们不使用K6，但论文却让K6显得必要。

我在 soft-crc 中找到了英特尔实现，但是它<一个href="https://github.com/intel/soft-crc/blob/34a84bfd8278ff48e6aa67f0746618242266c8a2/crc.h#L51-L72" rel="nofollow noreferrer">似乎没有使用相同的常量 ( K4-K6？μ？）

/**
 * PCLMULQDQ CRC computation context structure
 */
struct crc_pclmulqdq_ctx {
        /**
         * K1 = reminder X^128 / P(X) : 0-63
         * K2 = reminder X^192 / P(X) : 64-127
         */
        uint64_t k1;
        uint64_t k2;

        /**
         * K3 = reminder X^64 / P(X) : 0-63
         * q  = quotient X^64 / P(X) : 64-127
         */
        uint64_t k3;
        uint64_t q;

        /**
         * p   = polynomial / P(X) : 0-63
         * res = reserved : 64-127
         */
        uint64_t p;
        uint64_t res;
};

我相信我需要的poly 0xAF常量是：（

Px  = 0x1_0000_00AF
k1  = 0x0_5B5A_E0C7
k2  = 0x0_1BD8_1099
k3  = 0x0_1157_936A
k4  = 0x0_1010_1111
k5  = 0x0_0029_5F23
k6  = 0x0_0000_4455
μ   = 0x1_0000_00AF

来源：print-crc32-x86-sse42-magic-numbers.go 与 const px = "10000000000000000000000010101111"，或 0xaf | (1 << 32)< /code>）

我希望在

理解符号和公式方面获得帮助文章中使用的（以及为什么实现似乎与文章不同？），
将现有实现转换为非反射变体，或者可能是
一些伪代码？

原文

I'm trying to write a PCLMULQDQ-optimized CRC-32 implementation. The specific CRC-32 variant is for one that I don't own, but am trying to support in library form. In crcany model form, it has the following parameters:

width=32 poly=0xaf init=0xffffffff refin=false refout=false xorout=0x00000000 check=0xa5fd3138
(Omitted residue which I believe is 0x00000000 but honestly don't know what it is)

A basic non-table-based/bitwise implementation of the algorithm (as generated by crcany) is:

uint32_t crc32byond_bit(uint32_t crc, void const *mem, size_t len) {
    unsigned char const *data = mem;
    if (data == NULL)
        return 0xffffffff;
    for (size_t i = 0; i < len; i++) {
        crc ^= (uint32_t)data[i] << 24;
        for (unsigned k = 0; k < 8; k++) {
            crc = crc & 0x80000000 ? (crc << 1) ^ 0xaf : crc << 1;
        }
    }
    return crc;
}

First off, I don't understand the paper that describes the algorithm, at a fundamental level. I don't know what something like K1 = [x^(512+64) mod P(x)] means. (What is x? Where did it come from? I don't know.) I'm a professional software engineer; not an academic. Frankly, I don't understand the notation at all, and the Wikipedia article didn't do much for me. Maybe it's my weakness in linear algebra coming back to haunt me.

I know of a few public implementations I hoped to leech off:

But:

They use the bit-reflected algorithm, and I'm not sure how to create the non-reflected variant.
They don't seem to follow the paper? For example, wuffs and crc32fast especially call out that they don't use K6, but the paper makes K6 seem necessary.

I found an Intel implementation in soft-crc but it doesn't appear to use the same constants (K4-K6? μ?)

/**
 * PCLMULQDQ CRC computation context structure
 */
struct crc_pclmulqdq_ctx {
        /**
         * K1 = reminder X^128 / P(X) : 0-63
         * K2 = reminder X^192 / P(X) : 64-127
         */
        uint64_t k1;
        uint64_t k2;

        /**
         * K3 = reminder X^64 / P(X) : 0-63
         * q  = quotient X^64 / P(X) : 64-127
         */
        uint64_t k3;
        uint64_t q;

        /**
         * p   = polynomial / P(X) : 0-63
         * res = reserved : 64-127
         */
        uint64_t p;
        uint64_t res;
};

I believe the constants I need for poly 0xAF are:

Px  = 0x1_0000_00AF
k1  = 0x0_5B5A_E0C7
k2  = 0x0_1BD8_1099
k3  = 0x0_1157_936A
k4  = 0x0_1010_1111
k5  = 0x0_0029_5F23
k6  = 0x0_0000_4455
μ   = 0x1_0000_00AF

(source: print-crc32-x86-sse42-magic-numbers.go with const px = "100000000000000000000000010101111", or 0xaf | (1 << 32))

I'm hoping for help in either

Understanding the notation and formulas used in the article (and why implementations seem to diverge from the article?),
Converting an existing implementation for the non-reflected variant, or maybe
Some pseudo-code?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹿! 2025-01-18 13:38:57

我这里有6套16、32、64位crc的代码，非反射和反射。该代码是为 Visual Studio 设置的。注释已添加到英特尔 github 站点上缺少的常量中。

https://github.com/jeffareid/crc

32位非反射在这里：

https://github.com/jeffareid/crc/tree/master/crc32f

您需要更改 crc32fg.cpp 中生成常量的多项式。您想要的多项式实际上是：

0x00000001000000afull

我对 crc32fg.cpp 进行了更改，以在本答案末尾生成 rk.. 常量。

x 是什么？

CRC 使用具有 1 位系数的多项式。例如 0x0B 实际上是 x^3 + x + 1。XMM

寄存器可以读取|写入 16 个字节|一次 128 位，但 PCLMULQDQ 只能对两个 64 位操作数进行无进位乘法以产生 128 位乘积。因此，128 位被分成两个 64 位操作数，然后每个操作数乘以一个常数以向前“折叠”。由于 XMM 寄存器可以在一定程度上并行操作，因此使用 8 个 XMM 寄存器来读取 |写入 128 字节 |一次 1024 位。每个折叠步骤“前进”16 个字节 | 128位数据转发128字节| 1024 位，乘以常数。低 64 位乘以 x^(1024) mod poly 以创建“高级”1024 位的 128 位乘积。高 64 位乘以 x^(1024+64) mod poly 以创建高级字节 1024+64 位的 128 位乘积（需要 +64，因为该乘积基于 128 位的高 64 位）数据）。两个 128 位乘积被异或在一起，然后与数据 128 字节 | 进行异或。 1024 位之后在缓冲区中。

请注意，Intel 文档中的示例使用 4 个 XMM 寄存器将数据前进 64 个字节 |一次 512 位，但是我见过的 github 示例和我在 github 存储库中使用的示例使用 8 个 XMM 寄存器并前进 128 个字节 |一次 1024 位。对于支持AVX512的处理器数量相对较少| ZMM寄存器，有前进256字节的例子 |一次 2048 位。我没有配备 AVX512 的计算机，因此我没有任何代码。

由于 XMM 读|写是小端字节序，因此 PSHUFB 用于在每次读取后反转字节。

该代码主要基于使用 65 位多项式的 64 位 CRC，但对于 32 位 CRC，可以通过将较低 32 位设置为零来处理。对于 32 位 CRC，大多数常量只有 32 位，并左移 32 位以简化 PCLMULQDQ 的使用，并进行调整以补偿移位，因此不是 x^(a) mod poly，而是 (x^( a-32)mod聚)＜＜32。实际 33 位多项式及其“逆”x^64 / 多项式的常量为 33 位，并且不会左移，而使用这些常量创建实际 CRC 的 64 位值会左移 32 位。

折叠过程不会产生 CRC，它只是转换数据以推进数据。处理完所有数据后，8 个 XMM 寄存器将使用常量 rk09、rk10、...、rk19、rk20、rk01、rk02 折叠为一个 128 位值。此时（标签 _128_done:）有 128 位数据，并且由于代码基于 64 位 CRC 逻辑，因此逻辑上附加了 64 位零，因此实际上它是一个 192 位值，其中虚构的低 64 位是全部为零。高 64 位向前折叠 128 位，导致 XMM7 中的 128 位值准备好计算 64 位 CRC，但由于这是 32 位 CRC，因此 XMM7 有 96 位数据左移 32 位。高 32 位向前折叠 64 位（在这种情况下，96 位值和 rk06 都左移 32 位，因此在这种情况下 rk06 折叠 64 位（x^64 mod poly）并左移 32 位结果是 XMM7 中的 64 位值左移 32 位。

64 位数字除以 33 位多项式的商仅需要： 64 位数字的高 32 位，因此左移 64 位值具有高 32 位，可以方便地计算商。除法实际上是乘以 x^64 / 多项式，PCLMULQDQ 将仅指定使用。 XMM7 的高 64 位部分使用左移 64 位数字的高 32 位实际的 CRC 计算基于以下逻辑：

quotient = upper 32 bits of 64 bit value / polynomial
product  = quotient * polynomial
CRC      = 64 bit value XOR product

除法是通过乘以来完成的。逆：x^64 / Poly。由于 Poly 及其逆是 33 位，因此它们不能左移 32 位，因此代码在每次乘法后将乘积左移 4 个字节。CRC 最终位于第 32 到 63 位。 XMM7 和 pextrd eax,xmm7,1 用于获取这 32 位。

我修改了 crc32fg.cpp 以使用 CRC 多项式 0x1000000af，这就是我得到的。对于这个多项式，rk07 == rk08，但是对于其他多项式，它们会不同。

rk01    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk02    dq      0fafa517900000000h      ; x^(32* 5) mod P(x) << 32
rk03    dq      05cd86bb500000000h      ; x^(32*31) mod P(x) << 32
rk04    dq      0af6f37a300000000h      ; x^(32*33) mod P(x) << 32
rk05    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk06    dq      00000445500000000h      ; x^(32* 2) mod P(x) << 32
rk07    dq      000000001000000afh      ; floor(2^64/P(x))
rk08    dq      000000001000000afh      ; P(x)
rk09    dq      09bd57b5d00000000h      ; x^(32*27) mod P(x) << 32
rk10    dq      0b7a4d76400000000h      ; x^(32*29) mod P(x) << 32
rk11    dq      01ae0004200000000h      ; x^(32*23) mod P(x) << 32
rk12    dq      0e7720be600000000h      ; x^(32*25) mod P(x) << 32
rk13    dq      09c7fc8fe00000000h      ; x^(32*19) mod P(x) << 32
rk14    dq      03885faf800000000h      ; x^(32*21) mod P(x) << 32
rk15    dq      0b477ad7100000000h      ; x^(32*15) mod P(x) << 32
rk16    dq      00ac2ae3d00000000h      ; x^(32*17) mod P(x) << 32
rk17    dq      05eae9dbe00000000h      ; x^(32*11) mod P(x) << 32
rk18    dq      0784a483800000000h      ; x^(32*13) mod P(x) << 32
rk19    dq      07d21bf2000000000h      ; x^(32* 7) mod P(x) << 32
rk20    dq      0faebd3d300000000h      ; x^(32* 9) mod P(x) << 32

I have 6 sets of code for 16, 32, 64 bit crc, non-reflected and reflected here. The code is setup for Visual Studio. Comments have been added to the constants which were missing from Intel's github site.

https://github.com/jeffareid/crc

32 bit non-reflected is here:

https://github.com/jeffareid/crc/tree/master/crc32f

You'll need to change the polynomial in crc32fg.cpp, which generates the constants. The polynomial you want is actually:

0x00000001000000afull

I made this change to crc32fg.cpp to generate the rk.. constants at the end of this answer.

what is x?

A CRC uses polynomials with 1 bit coefficients. For example 0x0B is really x^3 + x + 1.

The XMM registers can read|write 16 bytes | 128 bits at a time, but PCLMULQDQ can only carryless multiply two 64 bit operands to produce a 128 bit product. So the 128 bits are split up into two 64 bit operands, then each operand is multiplied by a constant to "fold" it forwards. Since the XMM registers can operate somewhat in parallel, 8 XMM registers are used to read | write 128 bytes | 1024 bits at a time. Each folding step "advances" 16 bytes | 128 bits of data forwards 128 bytes | 1024 bits, by multiplying it by constants. The lower 64 bits are multiplied by x^(1024) mod poly to create a 128 bit product that is "advanced" by 1024 bits. The upper 64 bits are multiplied by x^(1024+64) mod poly to create a 128 bit product that is advanced byte 1024+64 bits (the +64 is needed because the product is based on the upper 64 bits of 128 bits of data). The two 128 bit products are xor'ed together, and then later xor'ed with data 128 bytes | 1024 bits later in the buffer.

Note that the example in the Intel document uses 4 XMM registers to advance data 64 bytes | 512 bits at a time, but the github examples I've seen and the examples I use in my github repository use 8 XMM registers and advance 128 bytes | 1024 bits at a time. For the relatively small number of processors that support AVX512 | ZMM registers, there are examples that advance 256 bytes | 2048 bits at a time. I don't own a computer with AVX512, so I don't have any code for that.

Since XMM read|writes are little endian, PSHUFB is used to reverse the bytes after each read.

The code is mostly based on 64 bit CRC that uses a 65 bit polynomial, but for a 32 bit CRC, this can be handled by setting the lower 32 bits to zero. For 32 bit CRC, most of the constants are only 32 bits, and shifted left 32 bits to simplify usage with PCLMULQDQ, and are adjusted to compensate for the shift, so instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32. The constants for the actual 33 bit polynomial and its "inverse" x^64 / polynomial are 33 bits and are not shifted left, and the 64 bit value that uses these constants to create the actual CRC is shifted left 32 bits instead.

The folding process doesn't produce a CRC, it just transforms data to advance it. Once all the data is processed, the 8 XMM registers are folded into one 128 bit value using the constants rk09, rk10, ..., rk19, rk20, rk01, rk02. At this point (label _128_done:) there are 128 bits of data, and since the code is based on 64 bit CRC logic, 64 bit's of zeroes are logically appended, so in effect it's a 192 bit value where the imaginary lower 64 bits are all zero. The upper 64 bits are folded forward by 128 bits, resulting in a 128 bit value in XMM7 ready to calculate a 64 bit CRC, but since this is a 32 bit CRC, XMM7 has 96 bits of data shifted left 32 bits. The upper 32 bits are folded forwards by 64 bits (in this case the 96 bit value and rk06 are both shifted left by 32 bits, so rk06 in this case folds by 64 bits (x^64 mod poly) and shifts left by 32 bits. The result is a 64 bit value shifted left 32 bits in XMM7.

The quotient of a 64 bit number divided by a 33 bit polynomial only needs the upper 32 bits of the 64 bit number, so the left shifted 64 bit value has the upper 32 bits where it will be convenient to calculate a quotient. The divide is really a multiply by x^64 / polynomial, PCLMULQDQ will just specify to use the upper 64 bit part of XMM7 to use the upper 32 bits of the left shifted 64 bit number. The actual CRC calculation is based on this logic:

quotient = upper 32 bits of 64 bit value / polynomial
product  = quotient * polynomial
CRC      = 64 bit value XOR product

The division is done by multiplying by the inverse: x^64 / poly. Since the poly and its inverse are 33 bits, they can't be shifted left 32 bits, so the code shifts the product left 4 bytes after each multiply. The CRC ends up in bits 32 to 63 of XMM7 and pextrd eax,xmm7,1 is used to get those 32 bits.

I modified crc32fg.cpp to use CRC polynomial 0x1000000af, and this is what I get. For this polynomial, rk07 == rk08, but for other polynomials, they will be different.

rk01    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk02    dq      0fafa517900000000h      ; x^(32* 5) mod P(x) << 32
rk03    dq      05cd86bb500000000h      ; x^(32*31) mod P(x) << 32
rk04    dq      0af6f37a300000000h      ; x^(32*33) mod P(x) << 32
rk05    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk06    dq      00000445500000000h      ; x^(32* 2) mod P(x) << 32
rk07    dq      000000001000000afh      ; floor(2^64/P(x))
rk08    dq      000000001000000afh      ; P(x)
rk09    dq      09bd57b5d00000000h      ; x^(32*27) mod P(x) << 32
rk10    dq      0b7a4d76400000000h      ; x^(32*29) mod P(x) << 32
rk11    dq      01ae0004200000000h      ; x^(32*23) mod P(x) << 32
rk12    dq      0e7720be600000000h      ; x^(32*25) mod P(x) << 32
rk13    dq      09c7fc8fe00000000h      ; x^(32*19) mod P(x) << 32
rk14    dq      03885faf800000000h      ; x^(32*21) mod P(x) << 32
rk15    dq      0b477ad7100000000h      ; x^(32*15) mod P(x) << 32
rk16    dq      00ac2ae3d00000000h      ; x^(32*17) mod P(x) << 32
rk17    dq      05eae9dbe00000000h      ; x^(32*11) mod P(x) << 32
rk18    dq      0784a483800000000h      ; x^(32*13) mod P(x) << 32
rk19    dq      07d21bf2000000000h      ; x^(32* 7) mod P(x) << 32
rk20    dq      0faebd3d300000000h      ; x^(32* 9) mod P(x) << 32

回复收藏 0 原文

~没有更多了~