当前位置：文江博客话题详情

扫描位流中位模式的最快方法

发布于 2024-08-08 01:57:47 字数 179 浏览 3 评论 0原文

我需要扫描比特流中的 16 位字。 不保证在字节或字边界上对齐。

实现这一目标最快的方法是什么？有多种暴力破解方法；使用表格和/或移位，但是是否有任何“位旋转快捷方式”可以通过给出是/否/可能包含每个字节或单词到达时的标志结果来减少计算次数？

C 代码、内在函数、x86 机器代码都会很有趣。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甲如呢乙后呢 2024-08-15 01:57:47

使用简单的蛮力有时是好的。

我认为预先计算该单词的所有移位值并将它们放入 16 个整数中
所以你得到了一个像这样的数组（假设 int 的宽度是 short 的两倍），

 unsigned short pattern = 1234;
 unsigned int preShifts[16];
 unsigned int masks[16];
 int i;
 for(i=0; i<16; i++)
 {
      preShifts[i] = (unsigned int)(pattern<<i);  //gets promoted to int
      masks[i] = (unsigned int) (0xffff<<i);
 }

然后对于从流中得到的每个无符号短整型，将其转换为 int和前一个 Short 并将该 unsigned int 与 16 个 unsigned int 进行比较。如果其中任何一个匹配，您就得到了一个。

所以基本上是这样的：

  int numMatch(unsigned short curWord, unsigned short prevWord)
  {
       int numHits = 0;
       int combinedWords = (prevWord<<16) + curWord;

       int i=0;
       for(i=0; i<16; i++)
       {
             if((combinedWords & masks[i]) == preShifsts[i]) numHits++;
       }
       return numHits;
  }

请注意，当在同一位上多次检测到模式时，这可能意味着多次命中：

例如，32 位 0 且您要检测的模式是 16 个 0，那么这意味着该模式是检测到16次！

假设它大约按照编写的方式编译，其时间成本是每个输入字 16 次检查。对于每个输入位，这会执行一个 & 和 ==，以及分支或其他条件增量。还有每一位掩码的表查找。

不需要查表；通过右移combined，我们可以显着提高asm的效率，如图所示在另一个答案中，它还展示了如何在 x86 上使用 SIMD 对其进行矢量化。

Using simple brute force is sometimes good.

I think precalc all shifted values of the word and put them in 16 ints
so you got an array like this (assuming int is twice as wide as short)

 unsigned short pattern = 1234;
 unsigned int preShifts[16];
 unsigned int masks[16];
 int i;
 for(i=0; i<16; i++)
 {
      preShifts[i] = (unsigned int)(pattern<<i);  //gets promoted to int
      masks[i] = (unsigned int) (0xffff<<i);
 }

and then for every unsigned short you get out of the stream, make an int of that short and the previous short and compare that unsigned int to the 16 unsigned int's. If any of them match, you got one.

So basically like this:

  int numMatch(unsigned short curWord, unsigned short prevWord)
  {
       int numHits = 0;
       int combinedWords = (prevWord<<16) + curWord;

       int i=0;
       for(i=0; i<16; i++)
       {
             if((combinedWords & masks[i]) == preShifsts[i]) numHits++;
       }
       return numHits;
  }

Do note that this could potentially mean multiple hits when the patterns is detected more than once on the same bits:

e.g. 32 bits of 0's and the pattern you want to detect is 16 0's, then it would mean the pattern is detected 16 times!

The time cost of this, assuming it compiles approximately as written, is 16 checks per input word. Per input bit, this does one & and ==, and branch or other conditional increment. And also a table lookup for the mask for every bit.

The table lookup is unnecessary; by instead right-shifting combined we get significantly more efficient asm, as shown in another answer which also shows how to vectorize this with SIMD on x86.

回复收藏 0 原文

盛夏已如深秋| 2024-08-15 01:57:47

如果两个字符 {0, 1} 的字母表上的 Knuth-Morris-Pratt 算法和 Reinier 的想法都不够快，那么这里有一个技巧可以将搜索速度加快 32 倍。

您可以首先使用包含 256 个条目的表来检查比特流中的每个字节是否包含在您要查找的 16 位字中。然后，您可以使用以下命令

unsigned char table[256];
for (int i=0; i<256; i++)
  table[i] = 0; // initialize with false
for (i=0; i<8; i++)
  table[(word >> i) & 0xff] = 1; // mark contained bytes with true

找到比特流中匹配的可能位置。256

for (i=0; i<length; i++) {
  if (table[bitstream[i]]) {
    // here comes the code which checks if there is really a match
  }
}

个表条目中最多有 8 个不为零，平均而言，您只需仔细查看每 32 个位置。仅对于这个字节（与前一个字节和后一个字节相结合），您必须使用位操作或一些屏蔽技术（如reinier建议的那样）来查看是否存在匹配。

该代码假设您使用小端字节顺序。字节中位的顺序也可能是一个问题（已经实现 CRC32 校验和的每个人都知道）。

Here is a trick to speed up the search by a factor of 32, if neither the Knuth-Morris-Pratt algorithm on the alphabet of two characters {0, 1} nor reinier's idea are fast enough.

You can first use a table with 256 entries to check for each byte in your bit stream if it is contained in the 16-bit word you are looking for. The table you get with

unsigned char table[256];
for (int i=0; i<256; i++)
  table[i] = 0; // initialize with false
for (i=0; i<8; i++)
  table[(word >> i) & 0xff] = 1; // mark contained bytes with true

You can then find possible positions for matches in the bit stream using

for (i=0; i<length; i++) {
  if (table[bitstream[i]]) {
    // here comes the code which checks if there is really a match
  }
}

As at most 8 of the 256 table entries are not zero, in average you have to take a closer look only at every 32th position. Only for this byte (combined with the bytes one before and one after) you have then to use bit operations or some masking techniques as suggested by reinier to see if there is a match.

The code assumes that you use little endian byte order. The order of the bits in a byte can also be an issue (known to everyone who already implemented a CRC32 checksum).

回复收藏 0 原文

黑色毁心梦 2024-08-15 01:57:47

我想建议一个使用 3 个大小为 256 的查找表的解决方案。这对于大比特流来说非常有效。该解决方案需要样本中的 3 个字节进行比较。下图显示了 3 个字节中 16 位数据的所有可能排列。每个字节区域以不同的颜色显示。

替代文本 http://img70.imageshack.us/img70/8711/80541519.jpg< /a>

这里将在第一个样本中检查 1 到 8，在下一个样本中检查 9 到 16，依此类推。现在，当我们搜索一个模式时，我们将找到该模式的所有8种可能的排列方式（如下所示），并将存储在3个查找表中（左、中、中）正确的）。

初始化查找表：

让我们以0111011101110111为例作为要查找的模式。现在考虑第四种安排。左侧部分为XXX01110。用 00010000 填充左查找表中左部分（XXX01110）指向的所有原始数据。 1表示输入Pattern排列的起始位置。因此，左查找表的后续 8 个原始数据将由 16 (00010000) 填充。

排列的中间部分是11101110。中间查找表中该索引 (238) 的原始指向将由 16 (00010000) 填充。

现在排列的右侧部分将是111XXXXX。所有索引为 111XXXXX 的原始数据（32 个原始数据）都将被填充为 16 (00010000)。

填充时我们不应该覆盖查找表中的元素。相反，执行按位或运算来更新已填充的原始数据。在上面的示例中，由第三种排列写入的所有原始数据将由第七种排列更新，如下所示。

因此，左查找表中索引为 XX011101、中查找表中索引为 11101110 以及右查找表中索引为 111XXXXX 的原始数据将更新为 00100010按第七种安排。

搜索模式：

取三个字节的样本。查找Count如下，其中Left是左查找表，Middle是中间查找表，Right是右查找表。

Count = Left[Byte0] & Middle[Byte1] & Right[Byte2];

Count 中的 1 表示所取样本中匹配的 Pattern 的数量。

我可以提供一些经过测试的示例代码。

初始化查找表：

    for( RightShift = 0; RightShift < 8; RightShift++ )
    {
        LeftShift = 8 - RightShift;

        Starting = 128 >> RightShift;

        Byte = MSB >> RightShift;

        Count = 0xFF >> LeftShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = ( i << LeftShift ) | Byte;

            Left[Index] |= Starting;
        }

        Byte = LSB << LeftShift;

        Count = 0xFF >> RightShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = i | Byte;

            Right[Index] |= Starting;
        }

        Index = ( unsigned char )(( Pattern >> RightShift ) & 0xFF );

        Middle[Index] |= Starting;
    }

搜索模式：

Data是流缓冲区，Left是左查找表，Middle 是中间查找表，Right是右查找表。

for( int Index = 1; Index < ( StreamLength - 1); Index++ )
{
    Count = Left[Data[Index - 1]] & Middle[Data[Index]] & Right[Data[Index + 1]];

    if( Count )
    {
        TotalCount += GetNumberOfOnes( Count );
    }
}

限制：

如果将模式放置在流缓冲区的最末尾，则上述循环无法检测到模式。以下代码需要添加 after 循环来克服此限制。

Count = Left[Data[StreamLength - 2]] & Middle[Data[StreamLength - 1]] & 128;

if( Count )
{
    TotalCount += GetNumberOfOnes( Count );
}

优点：

此算法仅需要 N-1 个逻辑步骤即可在 N 字节数组中查找模式。唯一的开销是最初填充查找表，这在所有情况下都是恒定的。所以这对于搜索巨大的字节流非常有效。

I would like to suggest a solution using 3 lookup tables of size 256. This would be efficient for large bit streams. This solution takes 3 bytes in a sample for comparison. Following figure shows all possible arrangements of a 16 bit data in 3 bytes. Each byte region has shown in different color.

alt text http://img70.imageshack.us/img70/8711/80541519.jpg

Here checking for 1 to 8 will be taken care in first sample and 9 to 16 in next sample and so on. Now when we are searching for a Pattern, we will find all the 8 possible arrangements (as below) of this Pattern and will store in 3 lookup tables (Left, Middle and Right).

Initializing Lookup Tables:

Lets take an example 0111011101110111 as a Pattern to find. Now consider 4th arrangement. Left part would be XXX01110. Fill all raws of Left lookup table pointing by left part (XXX01110) with 00010000. 1 indicates starting position of arrangement of input Pattern. Thus following 8 raws of Left look up table would be filled by 16 (00010000).

Middle part of arrangement would be 11101110. Raw pointing by this index (238) in Middle look up table will be filled by 16 (00010000).

Now Right part of arrangement would be 111XXXXX. All raws (32 raws) with index 111XXXXX will be filled by 16 (00010000).

We should not overwrite elements in look up table while filling. Instead do a bitwise OR operation to update an already filled raw. In above example, all raws written by 3rd arrangement would be updated by 7th arrangement as follows.

Thus raws with index XX011101 in Left lookup table and 11101110 in Middle lookup table and 111XXXXX in Right lookup table will be updated to 00100010 by 7th arrangement.

Searching Pattern:

Take a sample of three bytes. Find Count as follows where Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

Count = Left[Byte0] & Middle[Byte1] & Right[Byte2];

Number of 1 in Count gives the number of matching Pattern in taken sample.

I can give some sample code which is tested.

Initializing lookup table:

    for( RightShift = 0; RightShift < 8; RightShift++ )
    {
        LeftShift = 8 - RightShift;

        Starting = 128 >> RightShift;

        Byte = MSB >> RightShift;

        Count = 0xFF >> LeftShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = ( i << LeftShift ) | Byte;

            Left[Index] |= Starting;
        }

        Byte = LSB << LeftShift;

        Count = 0xFF >> RightShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = i | Byte;

            Right[Index] |= Starting;
        }

        Index = ( unsigned char )(( Pattern >> RightShift ) & 0xFF );

        Middle[Index] |= Starting;
    }

Searching Pattern:

Data is stream buffer, Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

for( int Index = 1; Index < ( StreamLength - 1); Index++ )
{
    Count = Left[Data[Index - 1]] & Middle[Data[Index]] & Right[Data[Index + 1]];

    if( Count )
    {
        TotalCount += GetNumberOfOnes( Count );
    }
}

Limitation:

Above loop cannot detect a Pattern if it is placed at the very end of stream buffer. Following code need to add after loop to overcome this limitation.

Count = Left[Data[StreamLength - 2]] & Middle[Data[StreamLength - 1]] & 128;

if( Count )
{
    TotalCount += GetNumberOfOnes( Count );
}

Advantage:

This algorithm takes only N-1 logical steps to find a Pattern in an array of N bytes. Only overhead is to fill the lookup tables initially which is constant in all the cases. So this will be very effective for searching huge byte streams.

回复收藏 0 原文

怂人 2024-08-15 01:57:47

我的钱花在 Knuth-Morris-Pratt 上两个字符的字母表。

回复收藏 0 原文

日暮斜阳 2024-08-15 01:57:47

我将实现一个具有 16 个状态的状态机。

每个状态代表有多少接收到的比特符合该模式。如果下一个接收到的位符合模式的下一个位，则机器进入下一个状态。如果不是这种情况，机器将返回到第一个状态（或者如果模式的开头可以与较少数量的接收比特相匹配，则返回到另一个状态）。

当机器到达最后一个状态时，这表明该模式已在比特流中被识别。

回复收藏 0 原文

魔法少女 2024-08-15 01:57:47

原子的

Knuth-Morris-Pratt

看起来不错，直到我考虑了卢克和 MSalter 要求提供更多有关细节的信息。

事实证明，这些细节可能表明一种比 KMP 更快的方法。 KMP 文章链接到

Boyer Moore

当搜索模式为“AAAAAA”时， '。对于多种模式搜索，

Rabin-Karp

可能最合适。

您可以在此处找到进一步的介绍性讨论。

回复收藏 0 原文

任性一次 2024-08-15 01:57:47

实现 @Toad 检查每个位位置的简单强力算法是将数据移动到位，而不是移动掩码。不需要任何数组，简单得多，只需在循环内右移组合 >>= 1 并比较低 16 位即可。（要么使用固定掩码，要么转换为uint16_t。）

（在多个问题中，我注意到创建掩码往往比仅移出不需要的位效率低。）

（正确处理uint16_t数组的最后一个16位块，或者特别是奇数大小的字节数组的最后一个字节，留给读者作为练习。）

// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i<len-1 ; i++) {
                //unsigned short curWord = bufshort[i];
                //unsigned short prevWord = bufshort[i+1];
                //int combinedWords = (prevWord<<16) + curWord;

                uint32_t combined;                                // assumes little-endian
                memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load

                for(int bitpos=0; bitpos<16; bitpos++) {
                        if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                return i*16 + bitpos;
                        combined >>= 1;
                }
        }
        return -1;
}

这编译显着对于大多数 ISA（例如 x86、AArch64 和 ARM），比使用最近的 gcc 和 clang 从数组加载掩码更有效。

编译器将循环完全展开 16，以便它们可以使用带有立即操作数的位域提取指令（例如 ARM ubfx 无符号位域提取或 PowerPC rwlinm 向左旋转 + 立即屏蔽位范围）来提取16 位到 32 或 64 位寄存器的底部，它们可以在其中进行常规比较和分支。实际上并不存在右移 1 的依赖链。

在 x86 上，CPU 可以执行忽略高位的 16 位比较，例如右移 后的 cmp cx,dx edx 中的组合

一些 ISA 的编译器设法在 @Toad 的版本上做得和这个版本一样好，例如 PowerPC 的 clang 设法使用 rlwinm 使用立即数屏蔽 16 位范围的组合，并且它将所有 16 个预移位模式值保留在 16 个寄存器中，因此无论哪种方式，它都只是 rlwinm / 比较 / 分支是否rlwinm 是否具有非零旋转计数。但右移版本不需要设置16个tmp寄存器。 https://godbolt.org/z/8mUaDI

AVX2 暴力破解

有（至少）2实现方法：

广播单个双字并使用变量移位来检查它的所有位位置，然后再继续。可能很容易找出您找到匹配的位置。（如果您想计数所有匹配项，则可能不太好。）
向量加载，并并行迭代多个数据窗口的位位置。也许使用从相邻字（16 位）开始的未对齐负载来重叠奇数/偶数向量，以获得双字（32 位）窗口。否则，您必须在 128 位通道上进行洗牌，最好是 16 位粒度，并且在没有 AVX512 的情况下需要 2 条指令。

使用 64 位元素移位而不是 32 位元素移位，我们可以检查多个相邻的 16 位窗口，而不是总是忽略高 16 位（其中移入零）。但我们在 SIMD 元素边界处仍然存在中断，其中移入了零，而不是来自更高地址的实际数据。（未来的解决方案：AVX512VBMI2双班，如 VPSHRDW，SHRD 的 SIMD 版本。）

也许无论如何，这样做都是值得的，然后返回我们在 __m256i 中每个 64 位元素顶部错过的 4x 16 位元素。也许将多个向量的剩余部分组合起来。

// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i<len-1 ; i++) {
        uint32_t combined; // assumes little-endian
        memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
        __m256i v = _mm256_set1_epi32(combined);
//      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
//      __m256i vhi = _mm256_srli_epi32(vlo, 8);

        // shift counts set up to match lane ordering for vpacksswb

        // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
        // * Haswell: 3 uops
        // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
        // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
        __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
        __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));

        __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
        __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);

        __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
        unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
        cmpmask &= 0x55555555;  // discard odd bits
        if (cmpmask) {
            return i*16 + __builtin_ctz(cmpmask)/2;
        }
    }
    return -1;
}
#endif

这对于通常快速找到命中的搜索很有用，尤其是在少于前 32 字节的数据中。对于大型搜索来说这还不错（但仍然是纯粹的蛮力，一次只检查 1 个单词），并且在 Skylake 上可能不比并行检查多个窗口的 16 个偏移量差。

这是针对 Skylake 进行调整的，在其他 CPU 上，变量移位效率较低，您可能会考虑仅对偏移量 0..7 进行 1 个变量移位，然后通过移位来创建偏移量 8..15。或者完全是别的东西。

编译效果出奇地好与 gcc/clang （在 Godbolt 上），具有直接从内存中广播的内部循环。（将 memcpy 未对齐加载和 set1() 优化为单个 vpbroadcastd）

Godbolt 链接中还包含一个测试 main 在一个小数组上运行它。（自上次调整以来我可能没有进行过测试，但我之前确实测试过它，并且打包 + 位扫描的东西确实有效。）

## clang8.0  -O3 -march=skylake  inner loop
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
        vpsrlvd ymm4, ymm3, ymm1
        vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
        vpcmpeqw        ymm4, ymm4, ymm0
        vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
        vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
        vpmovmskb       ecx, ymm3            # scalar bitmask
        and     ecx, 1431655765              # see if any even elements matched
        jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
        add     rdx, 1
        add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
        cmp     rdx, rsi
        jb      .LBB0_2                    # } while(i<len-1);
        # fall through to not-found.

这是 8 uops 的工作 + 3 uops 的循环开销（假设 and/jne 的宏融合，以及 cmp/jb，我们将在 Haswell/Skylake 上获得）。在 AMD 上，256 位指令是多个 uops，它会更多。

或者当然使用简单的右移立即数将所有元素移动 1，并并行检查多个窗口，而不是在同一窗口中进行多个偏移。

如果没有高效的变量移位（尤其是根本没有 AVX2），这对于大型搜索来说会更好，即使它需要更多的工作来整理第一个命中的位置，以防出现 很受欢迎。 （在除了最低元素之外的其他位置找到命中后，您需要检查所有早期窗口的所有剩余偏移量。）

A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

(correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)

// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i<len-1 ; i++) {
                //unsigned short curWord = bufshort[i];
                //unsigned short prevWord = bufshort[i+1];
                //int combinedWords = (prevWord<<16) + curWord;

                uint32_t combined;                                // assumes little-endian
                memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load

                for(int bitpos=0; bitpos<16; bitpos++) {
                        if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                return i*16 + bitpos;
                        combined >>= 1;
                }
        }
        return -1;
}

This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI

AVX2 brute-force

There are (at least) 2 ways to do this:

broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.

With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.

// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i<len-1 ; i++) {
        uint32_t combined; // assumes little-endian
        memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
        __m256i v = _mm256_set1_epi32(combined);
//      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
//      __m256i vhi = _mm256_srli_epi32(vlo, 8);

        // shift counts set up to match lane ordering for vpacksswb

        // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
        // * Haswell: 3 uops
        // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
        // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
        __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
        __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));

        __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
        __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);

        __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
        unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
        cmpmask &= 0x55555555;  // discard odd bits
        if (cmpmask) {
            return i*16 + __builtin_ctz(cmpmask)/2;
        }
    }
    return -1;
}
#endif

This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)

## clang8.0  -O3 -march=skylake  inner loop
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
        vpsrlvd ymm4, ymm3, ymm1
        vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
        vpcmpeqw        ymm4, ymm4, ymm0
        vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
        vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
        vpmovmskb       ecx, ymm3            # scalar bitmask
        and     ecx, 1431655765              # see if any even elements matched
        jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
        add     rdx, 1
        add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
        cmp     rdx, rsi
        jb      .LBB0_2                    # } while(i<len-1);
        # fall through to not-found.

That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.

Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)

回复收藏 0 原文

金橙橙 2024-08-15 01:57:47

SIMD 指令似乎很有用。 SSE2 添加了一堆整数指令，用于同时处理多个整数，但我无法想象有多少解决方案不涉及大量位移，因为您的数据不会对齐。这实际上听起来像是 FPGA 应该做的事情。

回复收藏 0 原文

财迷小姐 2024-08-15 01:57:47

我要做的是创建 16 个前缀和 16 个后缀。然后对于每个 16 位输入块确定最长的后缀匹配。如果下一个块具有长度为 (16-N) 的前缀匹配，那么您就获得了匹配。

后缀匹配实际上并不是 16 次比较。然而，这需要根据模式词进行预先计算。例如，如果模式字是 101010101010101010，您可以首先测试 16 位输入块的最后一位。如果该位为 0，则只需测试 ...10101010 就足够了。如果最后一位是1，则需要测试...1010101是否足够。每个都有 8 个，总共 1+8 次比较。如果模式字是 1111111111110000，您仍然需要测试输入的最后一位后缀匹配。如果该位为 1，则必须执行 12 次后缀匹配（正则表达式：1{1,12}），但如果为 0，则只有 4 个可能的匹配（正则表达式 1111 1111 1111 0{1,4}），同样是平均值共 9 项测试。添加 16-N 前缀匹配，您会发现每个 16 位块只需要 10 次检查。

回复收藏 0 原文

请你别敷衍 2024-08-15 01:57:47

对于通用的非 SIMD 算法，您不可能比这样做得更好：

unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes

do
{
  bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                   | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
  if( found ) { /* pattern found */ }
  accumulator >>= 8;

  unsigned int const data = next input byte
  accumulator |= (data<<8);
} while( there is input data left );

For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:

unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes

do
{
  bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                   | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
  if( found ) { /* pattern found */ }
  accumulator >>= 8;

  unsigned int const data = next input byte
  accumulator |= (data<<8);
} while( there is input data left );

回复收藏 0 原文

夏末的微笑 2024-08-15 01:57:47

您可以对极大的输入（n 值）使用快速傅里叶变换，以在 O(n log n ) 时间内找到任何位模式。计算位掩码与输入的互相关。序列 x 和大小分别为 n 和 n' 的掩码 y 的互相关由与

R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k

掩码完全匹配的位模式的出现情况定义，其中 R(m) = Y，其中 Y 是位中一个的总和面具。

因此，如果您尝试匹配位模式

[0 0 1 0 1 0]

，

[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]

则必须使用掩码，

[-1 -1  1 -1  1 -1]

掩码中的 -1 保证这些位置必须为 0。

您可以使用 FFT 在 O(n log n ) 时间内实现互相关。

我认为 KMP 的运行时间为 O(n + k)，所以它胜过这个。

You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by

R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k

then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.

So if you are trying to match for the bit pattern

[0 0 1 0 1 0]

[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]

then you must use the mask

[-1 -1  1 -1  1 -1]

the -1's in the mask guarantee that those places must be 0.

You can implement cross-correlation, using the FFT in O(n log n ) time.

I think KMP has O(n + k) runtime, so it beats this out.

回复收藏 0 原文

沫离伤花 2024-08-15 01:57:47

也许您应该在向量（vec_str）中流式传输您的比特流，在另一个向量（vec_pattern）中流式传输您的模式，然后执行类似于下面的算法的操作

i=0
while i<vec_pattern.length
    j=0
    while j<vec_str.length
            if (vec_str[j] xor vec_pattern[i])
                i=0
                j++

（希望该算法是正确的）

Maybe you should stream in your bit stream in a vector (vec_str), stream in your pattern in another vector (vec_pattern) and then do something like the algorithm below

i=0
while i<vec_pattern.length
    j=0
    while j<vec_str.length
            if (vec_str[j] xor vec_pattern[i])
                i=0
                j++

(hope the algorithm is correct)

回复收藏 0 原文

杯别 2024-08-15 01:57:47

在大位串中查找匹配的一种快速方法是计算一个查找表，该表显示给定输入字节与模式匹配的位偏移量。然后将三个连续的偏移匹配组合在一起，您可以获得一个位向量，显示哪些偏移与整个模式匹配。例如，如果字节 x 匹配模式的前 3 位，字节 x+1 匹配位 3..11，字节 x+2 匹配位 11..16，则字节 x + 5 位匹配。

下面是执行此操作的一些示例代码，一次累积两个字节的结果：

void find_matches(unsigned char* sequence, int n_sequence, unsigned short pattern) {
    if (n_sequence < 2)
        return; // 0 and 1 byte bitstring can't match a short

    // Calculate a lookup table that shows for each byte at what bit offsets
    // the pattern could match.
    unsigned int match_offsets[256];
    for (unsigned int in_byte = 0; in_byte < 256; in_byte++) {
        match_offsets[in_byte] = 0xFF;
        for (int bit = 0; bit < 24; bit++) {
            match_offsets[in_byte] <<= 1;
            unsigned int mask = (0xFF0000 >> bit) & 0xFFFF;
            unsigned int match_location = (in_byte << 16) >> bit;
            match_offsets[in_byte] |= !((match_location ^ pattern) & mask);
        }
    }

    // Go through the input 2 bytes at a time, looking up where they match and
    // anding together the matches offsetted by one byte. Each bit offset then
    // shows if the input sequence is consistent with the pattern matching at
    // that position. This is anded together with the large offsets of the next
    // result to get a single match over 3 bytes.
    unsigned int curr, next;
    curr = 0;
    for (int pos = 0; pos < n_sequence-1; pos+=2) {
        next = ((match_offsets[sequence[pos]] << 8) | 0xFF) & match_offsets[sequence[pos+1]];
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(pos, match);
        curr = next;
    }
    // Handle the possible odd byte at the end
    if (n_sequence & 1) {
        next = (match_offsets[sequence[n_sequence-1]] << 8) | 0xFF;
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(n_sequence-1, match);
    }
}

void output_match(int pos, unsigned short match) {
    for (int bit = 15; bit >= 0; bit--) {
        if (match & 1) {
            printf("Bitstring match at byte %d bit %d\n", (pos-2) + bit/8, bit % 8);
        }
        match >>= 1;
    }
}

该主循环有 18 条指令长，每次迭代处理 2 个字节。如果设置成本不是问题，这应该是尽可能快的。

A fast way to find the matches in big bitstrings would be to calculate a lookup table that shows the bit offsets where a given input byte matches the pattern. Then combining three consecutive offset matches together you can get a bit vector that shows which offsets match the whole pattern. For example if byte x matches first 3 bits of the pattern, byte x+1 matches bits 3..11 and byte x+2 matches bits 11..16, then there is a match at byte x + 5 bits.

Here's some example code that does this, accumulating the results for two bytes at a time:

void find_matches(unsigned char* sequence, int n_sequence, unsigned short pattern) {
    if (n_sequence < 2)
        return; // 0 and 1 byte bitstring can't match a short

    // Calculate a lookup table that shows for each byte at what bit offsets
    // the pattern could match.
    unsigned int match_offsets[256];
    for (unsigned int in_byte = 0; in_byte < 256; in_byte++) {
        match_offsets[in_byte] = 0xFF;
        for (int bit = 0; bit < 24; bit++) {
            match_offsets[in_byte] <<= 1;
            unsigned int mask = (0xFF0000 >> bit) & 0xFFFF;
            unsigned int match_location = (in_byte << 16) >> bit;
            match_offsets[in_byte] |= !((match_location ^ pattern) & mask);
        }
    }

    // Go through the input 2 bytes at a time, looking up where they match and
    // anding together the matches offsetted by one byte. Each bit offset then
    // shows if the input sequence is consistent with the pattern matching at
    // that position. This is anded together with the large offsets of the next
    // result to get a single match over 3 bytes.
    unsigned int curr, next;
    curr = 0;
    for (int pos = 0; pos < n_sequence-1; pos+=2) {
        next = ((match_offsets[sequence[pos]] << 8) | 0xFF) & match_offsets[sequence[pos+1]];
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(pos, match);
        curr = next;
    }
    // Handle the possible odd byte at the end
    if (n_sequence & 1) {
        next = (match_offsets[sequence[n_sequence-1]] << 8) | 0xFF;
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(n_sequence-1, match);
    }
}

void output_match(int pos, unsigned short match) {
    for (int bit = 15; bit >= 0; bit--) {
        if (match & 1) {
            printf("Bitstring match at byte %d bit %d\n", (pos-2) + bit/8, bit % 8);
        }
        match >>= 1;
    }
}

The main loop of this is 18 instructions long and processes 2 bytes per iteration. If the setup cost isn't an issue, this should be about as fast as it gets.

回复收藏 0 原文

~没有更多了~