什么是复制未对齐位数组的省时算法?
过去我曾多次这样做过,但我对结果一直不满意。
任何人都可以建议一种将连续位数组从源复制到目标的快速方法,其中源和目标的位置可能无法在方便的处理器边界上对齐(右移)?
如果源和目标都不是' t对齐,问题可以很快变成只有其中一个未对齐的问题(在第一个副本之后)。
作为一个起点,我的代码不可避免地最终看起来像下面这样(未经测试,忽略副作用,这只是一个即兴的例子):(
const char mask[8] = { 1, 3, 7, 15, 31, 63, 127, 255 };
/* Assume:
* - destination is already zeroed,
* - offsets are right shifts
* - bits to copy is big (> 32 say)
*/
int bitarray_copy(char * src, int src_bit_offset, int src_bit_len,
char * dst, int dst_bit_offset) {
if (src_bit_offset == dst_bit_offset) { /* Not very interesting */
} else {
int bit_diff_offset = src_bit_offset - dst_bit_offset; /* assume positive */
int loop_count;
char c;
char mask_val = mask[bit_diff_offset];
/* Get started, line up the destination. */
c = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
c &= mask[8-dst_bit_offset];
*dst++ |= c;
src_bit_len -= 8 - dst_bit_offset;
loop_count = src_bit_len >> 3;
while (--loop_count >= 0)
* dst ++ = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
/* Trailing tail copy etc ... */
if (src_bit_len % 8) /* ... */
}
}
实际上这比我之前做的要好。它看起来还不错)
I've had to do this many times in the past, and I've never been satisfied with the results.
Can anyone suggest a fast way of copying a contiguous bit array from source to destination where both the source and destination's may not be aligned (right shifted) on convenient processor boundaries?
If both the source and destination's aren't aligned , the problem can quickly be changed into one where only either of them aren't aligned (after the first copy say).
As a starting point, my code inevitably ends up looking something like the following (untested, ignore side effects this is just an off the cuff example):
const char mask[8] = { 1, 3, 7, 15, 31, 63, 127, 255 };
/* Assume:
* - destination is already zeroed,
* - offsets are right shifts
* - bits to copy is big (> 32 say)
*/
int bitarray_copy(char * src, int src_bit_offset, int src_bit_len,
char * dst, int dst_bit_offset) {
if (src_bit_offset == dst_bit_offset) { /* Not very interesting */
} else {
int bit_diff_offset = src_bit_offset - dst_bit_offset; /* assume positive */
int loop_count;
char c;
char mask_val = mask[bit_diff_offset];
/* Get started, line up the destination. */
c = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
c &= mask[8-dst_bit_offset];
*dst++ |= c;
src_bit_len -= 8 - dst_bit_offset;
loop_count = src_bit_len >> 3;
while (--loop_count >= 0)
* dst ++ = (*src++ << bit_diff_offset) | ((*src >> (8 - bit_diff_offset)) & mask_val);
/* Trailing tail copy etc ... */
if (src_bit_len % 8) /* ... */
}
}
(actually this is better than I've done before. It doesn't look too bad)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这就是我最终所做的。 (编辑于 2014 年 8 月 21 日针对一位复制错误进行了更改。)
This is what I ended up doing. (EDIT Changed on 8/21/2014 for a single bit copy bug.)
您的内部循环采用两个字节并将它们移动到目标字节。这几乎是最佳的。以下是一些其他提示(排名不分先后):
Your inner loop takes pieces of two bytes and moves them to a destination byte. That's almost optimal. Here are a few more hints in no particular order:
最佳方案取决于目标平台。在一些没有桶形移位器的平台上,将整个向量右移或左移一位 n 次(n<3)将是最快的方法(在 PIC18 平台上,一个 8 倍展开的字节循环左移一位将花费 11每八个字节的指令周期)。否则,我喜欢这种模式(注意 src2 必须根据您想要对缓冲区末尾执行的操作进行初始化),
这应该有助于在 ARM 上非常有效地实现(每两个字 8 个指令,如果寄存器可用于src、dest、src1、src2、shiftamount1 和 shiftamount2 可以通过多字加载/存储指令实现更快的操作。处理四个字类似于(每行一条机器指令,但前四行会在一起)。是一条指令,最后四行也是如此):
每 16 字节旋转 11 条指令。
What is optimal will depend upon the target platform. On some platforms without barrel shifters, shifting the whole vector right or left one bit, n times, for n<3, will be the fastest approach (on the PIC18 platform, an 8x-unrolled byte loop to shift left one bit will cost 11 instruction cycles per eight bytes). Otherwise, I like the pattern (note src2 will have to be initialized depending upon what you want done with the end of your buffer)
That should lend itself to very efficient implementation on an ARM (eight instructions every two words, if registers are available for src, dest, src1, src2, shiftamount1, and shiftamount2. Using more registers would allow faster operation via multi-word load/store instructions. Handling four words would be something like (one machine instruction per line, except the first four lines would together be one instruction, as would the last four lines ):
Eleven instructions per 16 bytes rotated.
您的解决方案看起来与我见过的大多数解决方案类似:基本上在开始和结束时执行一些未对齐的工作,中间的主循环使用对齐访问。如果您确实需要效率并在很长的比特流上执行此操作,我建议在主循环中使用特定于体系结构的东西,例如 SSE2。
Your solution looks similar to most I've seen: basically do some unaligned work at the start and end, with the main loop in the middle using aligned accesses. If you really need efficiency and do this on very long bitstreams, I would suggest using something architecture-specific like SSE2 in the main loop.