在并行位片代码中实现快速计数器
我正在寻找计数器的优化实现,可能类似于格雷码,这将使我能够快速单步遍历位片数组中的数字。
假设我有这个数组:
_m256 header[640];
我需要不断更改第 608 - 639 位中的计数器。256 位中的每一个都代表一个单独的并行计数器。
一个“增量”运算最多需要 31 次运算:AND 计算进位,XOR 计算值,每个位置重复一次。
格雷码应该只需要异或,但我不知道计算索引的有效方法 - 它似乎需要多达 31 次操作来确定位位置。
理想情况下,我想要一个需要少量 ALU 操作来确定要更改的位的计数器。有人知道有什么有用的吗?
I am looking for an optimized implementation of a counter, likely similar to gray-code, that will allow me to rapidly step through numbers in a bitsliced array.
Assuming I have the array:
_m256 header[640];
I need to keep changing a counter in bits 608 - 639. Each of the 256 bits represents a seperate, parallel counter.
An 'increment' operation takes up to 31 operations: AND to calculate carry, XOR to calculate value, repeated for each position.
Gray-code should only need xor, but I am unaware of an efficient way to calculate the index - it seems to require up to 31 operations to determine a bit position.
Ideally I would like a counter that requires a small number of ALU operations to determine what bit to change. Does anyone know of something that would be helpful?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
LRS 可以使用少量 XOR 生成包含所有非零数字 1..2^n-1 的序列,但在每个阶段将所有位向左移动。 http://www.ee 有一些信息。 unb.ca/cgi-bin/tervo/sequence.pl?binary=11111。 XOR 的数量取决于抽头的数量。 http://www.newwaveinstruments.com/resources/articles/m_sequence_linear_feedback_shift_register_lfsr/32stages.txt。当然,生成的序列是无序的——它显然是随机的。
An LRS can generate a sequence containing all the non-zero numbers 1..2^n-1, using a small number of XORs, but shifting all the bits left at each stage. There is some info at http://www.ee.unb.ca/cgi-bin/tervo/sequence.pl?binary=11111. The number of XORs depends on the number of taps. There is a list of LRS configurations for 32 bits with few taps at http://www.newwaveinstruments.com/resources/articles/m_sequence_linear_feedback_shift_register_lfsr/32stages.txt. Of course the sequence generated is out of order - it is apparently random.
有趣的论文:格雷码< /a>. (这是一个 PDF)
PDF 第 16 页包含一个算法,用于查找要切换的位。一般情况下,需要32次运算才能确定要改变哪一位。如果您可以从计数器中腾出一位(这将使其实际上成为 31 位计数器),则可以使平均增量时间需要更少的操作。
由于每当奇偶校验为偶数时就会切换低位,因此如果您可以保留奇偶校验位,那么每个其他增量操作都将只涉及切换低位。当然,您会在每次操作时切换奇偶校验位。
当奇偶校验为奇数时,您只需执行算法中找到要切换的位的部分即可。但由于您已经知道奇偶校验是奇数,因此您不必检查每一位。当找到满足条件的位时就可以停止。
我对 SIMD 不太熟悉,无法判断是否有任何方法可以优化它。
也就是说,我不清楚这样的事情是否会比“常规”二进制计数器所进行的“最多 31 次操作”有所改进。同样,一半的操作只需要一次迭代。使用格雷码的主要优点似乎是只需要一次写入内存(如果使用上面的奇偶校验位方法,则需要写入两次),而另一种方法可能需要最多 32 次写入内存。
Interesting paper: The Gray Code. (It's a PDF)
Page 16 of the PDF contains an algorithm for finding which bit to toggle. In the general case, it requires 32 operations to determine which bit to change. If you can spare one bit from your counter (which would make it effectively a 31-bit counter), you can make the average increment time take fewer operations.
Since the low bit is toggled whenever the parity is even, if you can keep a parity bit then every other increment operation will involve just toggling the low bit. And, of course, you'd toggle the parity bit with every operation.
When parity is odd, you just have to do the part of the algorithm that finds the bit to switch. But since you already know that the parity is odd, you don't have to examine every bit. You can stop when you find the bit that meets the condition.
I'm not familiar enough with SIMD to say if there's any way you can optimize that.
That said, it's not clear to me that such a thing would be an improvement over the "up to 31 operations" that the "regular" binary counter would take. There again, half of your operations will require just one iteration. It seems that the primary advantage of using Gray Code would be that only requires one write to memory (two, if you use the parity bit approach above), whereas the other method could require up to 32 writes to memory.
您可以通过这种方式实现 256 个并行线性移位反馈寄存器(基数==608)
这将有2^32-1 种不同的状态。如果 2^31-1 足够,请使用抽头 27 和 30 而不是 0,1,21,31。幻数来自 http://www.xilinx.com/support/documentation/ application_notes/xapp052.pdf。
You can implement 256 parallel Linear Shift Feedback Register this way (with base==608)
This will have 2^32-1 different states. If 2^31-1 is sufficient, use taps 27 and 30 instead of 0,1,21,31. The magic numbers are from http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf.