优化数组压缩

发布于 2024-12-11 15:24:55 字数 1116 浏览 4 评论 0原文

假设我有一个数组 k = [1 2 0 0 5 4 0]

我可以按如下方式计算掩码 m = k > 0 = [1 1 0 0 1 1 0]

仅使用掩码 m 和以下操作

左移/右移
和/或
加/减/乘

我可以将 k 压缩为以下 [1 2 5 4]

这是我目前的做法（MATLAB 伪代码）：

function out = compact( in )
    d = in
    for i = 1:size(in, 2) %do (# of items in in) passes
        m = d > 0
        %shift left, pad w/ 0 on right
        ml = [m(2:end) 0] % shift
        dl = [d(2:end) 0] % shift

        %if the data originally has a gap, fill it in w/ the 
        %left shifted one
        use = (m == 0) & (ml == 1) %2 comparison  

        d = use .* dl + ~use .* d

        %zero out elements that have been moved to the left
        use_r = [0 use(1:end-1)]
        d = d .* ~use_r
    end

    out = d(1 : size(find(in > 0), 2)) %truncate the end
end

直觉

每次迭代，我们将掩码向左移动并比较掩码。如果我们发现在这次移位之后，原来为 void(mask[i] = 0) 的索引现在有效(mask[i] = 1)，则我们将索引设置为具有左移数据。

问题

上述算法的复杂度为 O(N * (3 次移位 + 2 次比较 + AND + 加法 + 3 次乘法))。有没有办法提高其效率？

原文

Let's say I have an array
k = [1 2 0 0 5 4 0]

I can compute a mask as follows
m = k > 0 = [1 1 0 0 1 1 0]

Using only the mask m and the following operations

Shift left / right
And/Or
Add/Subtract/Multiply

I can compact k into the following
[1 2 5 4]

Here's how I currently do it (MATLAB pseudocode):

function out = compact( in )
    d = in
    for i = 1:size(in, 2) %do (# of items in in) passes
        m = d > 0
        %shift left, pad w/ 0 on right
        ml = [m(2:end) 0] % shift
        dl = [d(2:end) 0] % shift

        %if the data originally has a gap, fill it in w/ the 
        %left shifted one
        use = (m == 0) & (ml == 1) %2 comparison  

        d = use .* dl + ~use .* d

        %zero out elements that have been moved to the left
        use_r = [0 use(1:end-1)]
        d = d .* ~use_r
    end

    out = d(1 : size(find(in > 0), 2)) %truncate the end
end

Intuition

Each iteration, we shift the mask left and compare the mask. We set a index to have the left shifted data if we find that after this shift, an index that was originally void(mask[i] = 0) is now valid(mask[i] = 1).

Question

The above algorithm has O(N * (3 shift + 2 comparison + AND + add + 3 multiplies)). Is there a way to improve its efficiency?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最后的乘客 2024-12-18 15:24:55

原始伪代码没有太多需要优化的地方。我在这里看到了一些小改进：

循环可以少执行一次迭代（即 size-1），
如果“use”为零，您可以提前中断循环，
use = (m == 0) & (ml == 1) 可能可以简化为 use = ~m & ml，
如果~算作单独的操作，最好使用倒置形式：use = m | ~ml, d = ~use .* dl + use .* d, use_r = [1 use(1:end-1)], d = d .*use_r

但发明更好的算法是可能的。算法的选择取决于所使用的CPU资源：

Load-Store Unit，即将算法直接应用于内存字。在芯片制造商将高度并行的 SCATTER 指令添加到其指令集中之前，这里什么也做不了。
SSE 寄存器，即在寄存器的整个 16 字节上工作的算法。像所提出的伪代码这样的算法在这里无济于事，因为我们已经有了各种洗牌/排列指令，可以使工作更好。使用 PMOVMSKB 的各种比较指令、按 4 位对结果进行分组并在 switch/case 下应用各种洗牌指令（如 LastCoder 所描述）是我们能做的最好的事情。
具有最新指令集的 SSE/AVX 寄存器提供了更好的方法。我们可以直接使用 PMOVMSKB 的结果，将其转换为 PSHUFB 之类的控制寄存器。
整数寄存器，即 GPR 寄存器或同时在 SSE/AVX 寄存器的多个 DWORD/QWORD 部分上工作（允许执行多个独立的压缩）。所提出的应用于整数寄存器的伪代码允许压缩任意长度（从 2 到 20 位）的二进制子集。这是我的算法，它可能会表现得更好。

C++，64 位，子集宽度 = 8：

typedef unsigned long long ull;
const ull h = 0x8080808080808080;
const ull l = 0x0101010101010101;
const ull end = 0xffffffffffffffff;

// uncompacted bytes
ull x = 0x0100802300887700;

// set hi bit for zero bytes (see D.Knuth, volume 4)
ull m = h & ~(x | ((x|h) - l));

// bitmask for nonzero bytes
m = ~(m | (m - (m>>7)));

// tail zero bytes need no special treatment
m |= (m - 1);

while (m != end)
{
  ull tailm = m ^ (m + 1); // bytes to be processed
  ull tailx = x & tailm; // get the bytes
  tailm |= (tailm << 8); // shift 1 byte at a time
  m |= tailm; // all processed bytes are masked
  x = (x ^ tailx) | (tailx << 8); // actual byte shift
}

There is no much to optimize in the original pseudo-code. I see several small improvements here:

loop may perform one iteration less (i.e. size-1),
if 'use' is zero, you may break the loop early,
use = (m == 0) & (ml == 1) probably may be simplified to use = ~m & ml,
if ~ is counted as separate operation, it would be better to use the inverted form : use = m | ~ml, d = ~use .* dl + use .* d, use_r = [1 use(1:end-1)], d = d .*use_r

But it is possible to invent better algorithms. And the choice of algorithm depends on CPU resources used:

Load-Store Unit, i.e. apply algorithm directly to memory words. Nothing can be done here until chipmakers add highly parallel SCATTER instruction to their instruction sets.
SSE registers, i.e. algorithms working on entire 16 bytes of the registers. Algorithms like the proposed pseudo-code cannot help here because we already have various shuffle/permute instructions which make the work better. Using various compare instructions with PMOVMSKB, grouping the result by 4 bits and applying various shuffle instructions under switch/case (as described by LastCoder) is the best we can do.
SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB.
Integer registers, i.e. GPR registers or working simultaneously on several DWORD/QWORD parts of SSE/AVX registers (which allows to perform several independent compactions). The proposed pseudo-code applied to integer registers allows to compact binary subsets of any length (from 2 to 20 bits). Here is my algorithm, which is likely to perform better.

C++, 64 bit, subset width = 8:

typedef unsigned long long ull;
const ull h = 0x8080808080808080;
const ull l = 0x0101010101010101;
const ull end = 0xffffffffffffffff;

// uncompacted bytes
ull x = 0x0100802300887700;

// set hi bit for zero bytes (see D.Knuth, volume 4)
ull m = h & ~(x | ((x|h) - l));

// bitmask for nonzero bytes
m = ~(m | (m - (m>>7)));

// tail zero bytes need no special treatment
m |= (m - 1);

while (m != end)
{
  ull tailm = m ^ (m + 1); // bytes to be processed
  ull tailx = x & tailm; // get the bytes
  tailm |= (tailm << 8); // shift 1 byte at a time
  m |= tailm; // all processed bytes are masked
  x = (x ^ tailx) | (tailx << 8); // actual byte shift
}

回复收藏 0 原文

寄居者 2024-12-18 15:24:55

因此，您需要弄清楚对于这样一个简单的任务，额外的并行性、移位/洗牌开销是否值得。

for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
 if(mask[inIdx] == 1) {
  out[outIdx] = in[inIdx];
  outIdx++;
 }
}

如果您想采用并行 SIMD 路线，最好的选择是使用 SWITCH CASE，其中包含掩码接下来 4 位的所有可能排列。为什么不是8？因为 PSHUFD 指令只能在 XMMX m128 上洗牌，而不能在 YMMX m256 上洗牌。

所以你做了 16 种情况：

[1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] 不需要任何特殊的移位/洗牌只需将输入复制到输出 MOVDQU 并将输出指针分别递增 4、3、2、1、0。
[0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] 你只需要使用 PSRLx (右移逻辑）并将输出指针分别增加 3, 2, 2, 1, 1, 1
[1 0 0 1], [1 0 1 0], [0 1 0 1]、[1 0 1 1]、[1 1 0 1] 使用 PSHUFD 打包输入，然后分别将输出指针增加 2、2、2、3、3。

因此，每种情况都是最少量的处理（1 到 2 个 SIMD 指令和 1 个输出指针加法）。 case 语句的周围循环将处理常量输入指针加法（加 4）和 MOVDQA 来加载输入。

So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.

for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
 if(mask[inIdx] == 1) {
  out[outIdx] = in[inIdx];
  outIdx++;
 }
}

If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.

So you make 16 Cases:

[1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] don't need any special shift/shuffle you just copy the input to the output MOVDQU and increment the output pointer by 4, 3, 2, 1, 0 respectively.
[0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] you just need to use PSRLx (shift right logical) and increment the output pointer by 3, 2, 2, 1, 1, 1 respectively
[1 0 0 1], [1 0 1 0], [0 1 0 1], [1 0 1 1], [1 1 0 1] you use the PSHUFD to pack your input then increment your output pointer by 2, 2, 2, 3, 3 respectively.

So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.

回复收藏 0 原文

活泼老夫 2024-12-18 15:24:55

原始代码一次仅移动数组元素一步。这可能会得到改善。可以对数组元素进行分组并一次将它们移动 2^k 步。

该算法的第一部分计算每个元素应移动多少步。第二部分移动元素 - 首先移动一步，然后移动 2 步，然后移动 4 步，等等。这可以正常工作，并且元素不会混合，因为每次移位后都有足够的空间来执行 2 倍大的移位。

Matlab，代码未测试：

function out = compact( in )
    m = in <= 0
    for i = 1:size(in, 2)-1
        m = [0 m(1:end-1)]
        s = s + m
    end

    d = in
    shift = 1
    for j = 1:ceil(log2(size(in, 2)))
        s1 = rem(s, 2)
        s = (s - s1) / 2
        d = (d .* ~s1) + ([d(1+shift:end) zeros(1,shift)] .* [s1(1+shift:end) zeros(1,shift)])
        shift = shift*2
    end
    out = d
end

上述算法的复杂度为 O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift))。

Original code moves array element only one step at a time. This may be improved. It is possible to group array elements and shift them 2^k steps at once.

First part of this algorithm computes how many steps should each element be shifted. Second part moves elements - first by one step, then by 2, then 4, etc. This works correctly and elements are not intermixed because after each shift there is enough space to perform 2 times larger shift.

Matlab, code not tested:

function out = compact( in )
    m = in <= 0
    for i = 1:size(in, 2)-1
        m = [0 m(1:end-1)]
        s = s + m
    end

    d = in
    shift = 1
    for j = 1:ceil(log2(size(in, 2)))
        s1 = rem(s, 2)
        s = (s - s1) / 2
        d = (d .* ~s1) + ([d(1+shift:end) zeros(1,shift)] .* [s1(1+shift:end) zeros(1,shift)])
        shift = shift*2
    end
    out = d
end

The above algorithm's complexity is O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift)).

回复收藏 0 原文

彻夜缠绵 2024-12-18 15:24:55

阅读原始问题下面的评论，在实际问题中，数组包含 32 位浮点数，掩码是（一个？）32 位整数，所以我不明白为什么应该使用移位等压缩数组。简单的压缩算法（用 C 语言）将是这样的：

float array[8];
unsigned int mask = ...;
int a = 0, b = 0;
while (mask) {
  if (mask & 1) { array[a++] = array[b]; }
  b++;
  mask >>= 1;
}
/* Size of compacted array is 'a' */
/* Optionally clear the rest: */
while (a < 8) array[a++] = 0.0;

较小的变化将由掩码的位顺序引起，但唯一需要的 ALU 操作是索引变量更新以及掩码的移位和与操作。由于原始数组至少有 256 位宽，因此普通 CPU 无法按位移动整个数组。

Reading the comments below the original question, in the actual problem the array contains 32-bit floating point numbers, and the mask is (one?) 32-bit integer, so I don't get it why shifts etc. should be used for compacting the array. The simple compacting algorithm (in C) would be something like this:

float array[8];
unsigned int mask = ...;
int a = 0, b = 0;
while (mask) {
  if (mask & 1) { array[a++] = array[b]; }
  b++;
  mask >>= 1;
}
/* Size of compacted array is 'a' */
/* Optionally clear the rest: */
while (a < 8) array[a++] = 0.0;

Minor variations would be due to the bit order of the mask, but the only ALU operations that are needed are index variable updates and shifting and ANDing the mask. Because the original array is at least 256 bits wide, no usual CPU can shift the whole array bit-wise around.

回复收藏 0 原文

一笔一画续写前缘 2024-12-18 15:24:55

假设您想要的是在 C++ 中以最少的步骤仅存储数组中的正整数，这是一个示例代码：

int j = 0;
int arraysize = (sizeof k)/4;
int store[arraysize];
for(int i = 0; i<arraysize; i++)
{
    if(k[i] > 0)
    {
        store[j] = k[i];
        j++;
    }
}

或者，如果您不想使用 <，也可以直接使用 k[ ] 的元素代码>for循环。

Assuming what you want is to store only positive integers from an array with minimum steps in C++ this is a sample code:

int j = 0;
int arraysize = (sizeof k)/4;
int store[arraysize];
for(int i = 0; i<arraysize; i++)
{
    if(k[i] > 0)
    {
        store[j] = k[i];
        j++;
    }
}

Or you can directly use elements of k[ ] if you don't want to use for loop.

回复收藏 0 原文

~没有更多了~