什么是复制未对齐位数组的省时算法?

发布于 2024-09-15 07:40:02 字数 1391 浏览 8 评论 0原文

过去我曾多次这样做过,但我对结果一直不满意。

任何人都可以建议一种将连续位数组从源复制到目标的快速方法,其中源和目标的位置可能无法在方便的处理器边界上对齐(右移)?

如果源和目标都不是' t对齐,问题可以很快变成只有其中一个未对齐的问题(在第一个副本之后)。

作为一个起点,我的代码不可避免地最终看起来像下面这样(未经测试,忽略副作用,这只是一个即兴的例子):(

const char mask[8] = { 1, 3, 7, 15, 31, 63, 127, 255 };
/* Assume:
 * - destination is already zeroed,
 * - offsets are right shifts
 * - bits to copy is big (> 32 say)
 */
int bitarray_copy(char * src, int src_bit_offset, int src_bit_len,
                  char * dst, int dst_bit_offset) {
    if (src_bit_offset == dst_bit_offset) { /* Not very interesting */ 
    } else {
        int bit_diff_offset = src_bit_offset - dst_bit_offset; /* assume positive */
        int loop_count;
        char c;
        char mask_val = mask[bit_diff_offset];

        /* Get started, line up the destination. */
        c  = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
        c &= mask[8-dst_bit_offset];

        *dst++ |= c;

        src_bit_len -= 8 - dst_bit_offset;
        loop_count = src_bit_len >> 3;

        while (--loop_count >= 0) 
            * dst ++ = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);

        /* Trailing tail copy etc ... */
        if (src_bit_len % 8) /* ... */
    }
}

实际上这比我之前做的要好。它看起来还不错)

I've had to do this many times in the past, and I've never been satisfied with the results.

Can anyone suggest a fast way of copying a contiguous bit array from source to destination where both the source and destination's may not be aligned (right shifted) on convenient processor boundaries?

If both the source and destination's aren't aligned , the problem can quickly be changed into one where only either of them aren't aligned (after the first copy say).

As a starting point, my code inevitably ends up looking something like the following (untested, ignore side effects this is just an off the cuff example):

const char mask[8] = { 1, 3, 7, 15, 31, 63, 127, 255 };
/* Assume:
 * - destination is already zeroed,
 * - offsets are right shifts
 * - bits to copy is big (> 32 say)
 */
int bitarray_copy(char * src, int src_bit_offset, int src_bit_len,
                  char * dst, int dst_bit_offset) {
    if (src_bit_offset == dst_bit_offset) { /* Not very interesting */ 
    } else {
        int bit_diff_offset = src_bit_offset - dst_bit_offset; /* assume positive */
        int loop_count;
        char c;
        char mask_val = mask[bit_diff_offset];

        /* Get started, line up the destination. */
        c  = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
        c &= mask[8-dst_bit_offset];

        *dst++ |= c;

        src_bit_len -= 8 - dst_bit_offset;
        loop_count = src_bit_len >> 3;

        while (--loop_count >= 0) 
            * dst ++ = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);

        /* Trailing tail copy etc ... */
        if (src_bit_len % 8) /* ... */
    }
}

(actually this is better than I've done before. It doesn't look too bad)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

千仐 2024-09-22 07:40:02

这就是我最终所做的。 (编辑于 2014 年 8 月 21 日针对一位复制错误进行了更改。)

#include <limits.h>
#include <string.h>
#include <stddef.h>

#define PREPARE_FIRST_COPY()                                      \
    do {                                                          \
    if (src_len >= (CHAR_BIT - dst_offset_modulo)) {              \
        *dst     &= reverse_mask[dst_offset_modulo];              \
        src_len -= CHAR_BIT - dst_offset_modulo;                  \
    } else {                                                      \
        *dst     &= reverse_mask[dst_offset_modulo]               \
              | reverse_mask_xor[dst_offset_modulo + src_len];    \
         c       &= reverse_mask[dst_offset_modulo + src_len];    \
        src_len = 0;                                              \
    } } while (0)


static void
bitarray_copy(const unsigned char *src_org, int src_offset, int src_len,
                    unsigned char *dst_org, int dst_offset)
{
    static const unsigned char mask[] =
        { 0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
    static const unsigned char reverse_mask[] =
        { 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff };
    static const unsigned char reverse_mask_xor[] =
        { 0xff, 0x7f, 0x3f, 0x1f, 0x0f, 0x07, 0x03, 0x01, 0x00 };

    if (src_len) {
        const unsigned char *src;
              unsigned char *dst;
        int                  src_offset_modulo,
                             dst_offset_modulo;

        src = src_org + (src_offset / CHAR_BIT);
        dst = dst_org + (dst_offset / CHAR_BIT);

        src_offset_modulo = src_offset % CHAR_BIT;
        dst_offset_modulo = dst_offset % CHAR_BIT;

        if (src_offset_modulo == dst_offset_modulo) {
            int              byte_len;
            int              src_len_modulo;
            if (src_offset_modulo) {
                unsigned char   c;

                c = reverse_mask_xor[dst_offset_modulo]     & *src++;

                PREPARE_FIRST_COPY();
                *dst++ |= c;
            }

            byte_len = src_len / CHAR_BIT;
            src_len_modulo = src_len % CHAR_BIT;

            if (byte_len) {
                memcpy(dst, src, byte_len);
                src += byte_len;
                dst += byte_len;
            }
            if (src_len_modulo) {
                *dst     &= reverse_mask_xor[src_len_modulo];
                *dst |= reverse_mask[src_len_modulo]     & *src;
            }
        } else {
            int             bit_diff_ls,
                            bit_diff_rs;
            int             byte_len;
            int             src_len_modulo;
            unsigned char   c;
            /*
             * Begin: Line things up on destination. 
             */
            if (src_offset_modulo > dst_offset_modulo) {
                bit_diff_ls = src_offset_modulo - dst_offset_modulo;
                bit_diff_rs = CHAR_BIT - bit_diff_ls;

                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                c     &= reverse_mask_xor[dst_offset_modulo];
            } else {
                bit_diff_rs = dst_offset_modulo - src_offset_modulo;
                bit_diff_ls = CHAR_BIT - bit_diff_rs;

                c = *src >> bit_diff_rs     &
                    reverse_mask_xor[dst_offset_modulo];
            }
            PREPARE_FIRST_COPY();
            *dst++ |= c;

            /*
             * Middle: copy with only shifting the source. 
             */
            byte_len = src_len / CHAR_BIT;

            while (--byte_len >= 0) {
                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                *dst++ = c;
            }

            /*
             * End: copy the remaing bits; 
             */
            src_len_modulo = src_len % CHAR_BIT;
            if (src_len_modulo) {
                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                c     &= reverse_mask[src_len_modulo];

                *dst     &= reverse_mask_xor[src_len_modulo];
                *dst |= c;
            }
        }
    }
}

This is what I ended up doing. (EDIT Changed on 8/21/2014 for a single bit copy bug.)

#include <limits.h>
#include <string.h>
#include <stddef.h>

#define PREPARE_FIRST_COPY()                                      \
    do {                                                          \
    if (src_len >= (CHAR_BIT - dst_offset_modulo)) {              \
        *dst     &= reverse_mask[dst_offset_modulo];              \
        src_len -= CHAR_BIT - dst_offset_modulo;                  \
    } else {                                                      \
        *dst     &= reverse_mask[dst_offset_modulo]               \
              | reverse_mask_xor[dst_offset_modulo + src_len];    \
         c       &= reverse_mask[dst_offset_modulo + src_len];    \
        src_len = 0;                                              \
    } } while (0)


static void
bitarray_copy(const unsigned char *src_org, int src_offset, int src_len,
                    unsigned char *dst_org, int dst_offset)
{
    static const unsigned char mask[] =
        { 0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
    static const unsigned char reverse_mask[] =
        { 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff };
    static const unsigned char reverse_mask_xor[] =
        { 0xff, 0x7f, 0x3f, 0x1f, 0x0f, 0x07, 0x03, 0x01, 0x00 };

    if (src_len) {
        const unsigned char *src;
              unsigned char *dst;
        int                  src_offset_modulo,
                             dst_offset_modulo;

        src = src_org + (src_offset / CHAR_BIT);
        dst = dst_org + (dst_offset / CHAR_BIT);

        src_offset_modulo = src_offset % CHAR_BIT;
        dst_offset_modulo = dst_offset % CHAR_BIT;

        if (src_offset_modulo == dst_offset_modulo) {
            int              byte_len;
            int              src_len_modulo;
            if (src_offset_modulo) {
                unsigned char   c;

                c = reverse_mask_xor[dst_offset_modulo]     & *src++;

                PREPARE_FIRST_COPY();
                *dst++ |= c;
            }

            byte_len = src_len / CHAR_BIT;
            src_len_modulo = src_len % CHAR_BIT;

            if (byte_len) {
                memcpy(dst, src, byte_len);
                src += byte_len;
                dst += byte_len;
            }
            if (src_len_modulo) {
                *dst     &= reverse_mask_xor[src_len_modulo];
                *dst |= reverse_mask[src_len_modulo]     & *src;
            }
        } else {
            int             bit_diff_ls,
                            bit_diff_rs;
            int             byte_len;
            int             src_len_modulo;
            unsigned char   c;
            /*
             * Begin: Line things up on destination. 
             */
            if (src_offset_modulo > dst_offset_modulo) {
                bit_diff_ls = src_offset_modulo - dst_offset_modulo;
                bit_diff_rs = CHAR_BIT - bit_diff_ls;

                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                c     &= reverse_mask_xor[dst_offset_modulo];
            } else {
                bit_diff_rs = dst_offset_modulo - src_offset_modulo;
                bit_diff_ls = CHAR_BIT - bit_diff_rs;

                c = *src >> bit_diff_rs     &
                    reverse_mask_xor[dst_offset_modulo];
            }
            PREPARE_FIRST_COPY();
            *dst++ |= c;

            /*
             * Middle: copy with only shifting the source. 
             */
            byte_len = src_len / CHAR_BIT;

            while (--byte_len >= 0) {
                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                *dst++ = c;
            }

            /*
             * End: copy the remaing bits; 
             */
            src_len_modulo = src_len % CHAR_BIT;
            if (src_len_modulo) {
                c = *src++ << bit_diff_ls;
                c |= *src >> bit_diff_rs;
                c     &= reverse_mask[src_len_modulo];

                *dst     &= reverse_mask_xor[src_len_modulo];
                *dst |= c;
            }
        }
    }
}
一百个冬季 2024-09-22 07:40:02

您的内部循环采用两个字节并将它们移动到目标字节。这几乎是最佳的。以下是一些其他提示(排名不分先后):

  • 无需将自己限制为一次一个字节。使用您的平台允许您使用的最大整数大小。这当然会让你的开始和结尾逻辑变得复杂。
  • 如果您使用无符号字符或整数,则可能不需要在右移后屏蔽源的第二部分。这取决于您的编译器。
  • 如果您确实需要掩码,请确保您的编译器将表查找移到循环之外。如果不是,请将其复制到临时变量并使用它。

Your inner loop takes pieces of two bytes and moves them to a destination byte. That's almost optimal. Here are a few more hints in no particular order:

  • There's no need to limit yourself to a byte at a time. Use the largest integer size your platform will let you get away with. This of course will complicate your starting and trailing logic.
  • If you use unsigned chars or integers, you may not need to mask the second piece of the source after it's shifted right. This will depend on your compiler.
  • If you do need the mask, make sure your compiler is moving the table lookup outside of the loop. If it isn't, copy it to a temporary variable and use that.
意中人 2024-09-22 07:40:02

最佳方案取决于目标平台。在一些没有桶形移位器的平台上,将整个向量右移或左移一位 n 次(n<3)将是最快的方法(在 PIC18 平台上,一个 8 倍展开的字节循环左移一位将花费 11每八个字节的指令周期)。否则,我喜欢这种模式(注意 src2 必须根据您想要对缓冲区末尾执行的操作进行初始化),

  src1 = *src++;
  src2 = (src1 shl shiftamount1) | (src2 shr shiftamount2);
  *dest++ = src2;
  src2 = *src++;
  src1 = (src2 shl shiftamount1) | (src1 shr shiftamount2);
  *dest++ = src1;

这应该有助于在 ARM 上非常有效地实现(每两个字 8 个指令,如果寄存器可用于src、dest、src1、src2、shiftamount1 和 shiftamount2 可以通过多字加载/存储指令实现更快的操作。处理四个字类似于(每行一条机器指令,但前四行会在一起)。是一条指令,最后四行也是如此):

  src0 = *src++;
  src1 = *src++;
  src2 = *src++;
  src3 = *src++;
  tmp  = src0;
  src0 = src0 shr shiftamount1
  src0 = src0 | src1 shl shiftamount2
  src1 = src1 shr shiftamount1
  src1 = src1 | src2 shl shiftamount2
  src2 = src2 shr shiftamount1
  src2 = src2 | src3 shl shiftamount2
  src3 = src3 shr shiftamount1
  src3 = src3 | tmp shl shiftamount2
  *dest++ = src0;
  *dest++ = src1;
  *dest++ = src2;
  *dest++ = src3;

每 16 字节旋转 11 条指令。

What is optimal will depend upon the target platform. On some platforms without barrel shifters, shifting the whole vector right or left one bit, n times, for n<3, will be the fastest approach (on the PIC18 platform, an 8x-unrolled byte loop to shift left one bit will cost 11 instruction cycles per eight bytes). Otherwise, I like the pattern (note src2 will have to be initialized depending upon what you want done with the end of your buffer)

  src1 = *src++;
  src2 = (src1 shl shiftamount1) | (src2 shr shiftamount2);
  *dest++ = src2;
  src2 = *src++;
  src1 = (src2 shl shiftamount1) | (src1 shr shiftamount2);
  *dest++ = src1;

That should lend itself to very efficient implementation on an ARM (eight instructions every two words, if registers are available for src, dest, src1, src2, shiftamount1, and shiftamount2. Using more registers would allow faster operation via multi-word load/store instructions. Handling four words would be something like (one machine instruction per line, except the first four lines would together be one instruction, as would the last four lines ):

  src0 = *src++;
  src1 = *src++;
  src2 = *src++;
  src3 = *src++;
  tmp  = src0;
  src0 = src0 shr shiftamount1
  src0 = src0 | src1 shl shiftamount2
  src1 = src1 shr shiftamount1
  src1 = src1 | src2 shl shiftamount2
  src2 = src2 shr shiftamount1
  src2 = src2 | src3 shl shiftamount2
  src3 = src3 shr shiftamount1
  src3 = src3 | tmp shl shiftamount2
  *dest++ = src0;
  *dest++ = src1;
  *dest++ = src2;
  *dest++ = src3;

Eleven instructions per 16 bytes rotated.

ヤ经典坏疍 2024-09-22 07:40:02

您的解决方案看起来与我见过的大多数解决方案类似:基本上在开始和结束时执行一些未对齐的工作,中间的主循环使用对齐访问。如果您确实需要效率并在很长的比特流上执行此操作,我建议在主循环中使用特定于体系结构的东西,例如 SSE2。

Your solution looks similar to most I've seen: basically do some unaligned work at the start and end, with the main loop in the middle using aligned accesses. If you really need efficiency and do this on very long bitstreams, I would suggest using something architecture-specific like SSE2 in the main loop.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文