优化数组压缩
假设我有一个数组 k = [1 2 0 0 5 4 0]
我可以按如下方式计算掩码 m = k > 0 = [1 1 0 0 1 1 0]
仅使用掩码 m 和以下操作
- 左移/右移
- 和/或
- 加/减/乘
我可以将 k 压缩为以下 [1 2 5 4]
这是我目前的做法(MATLAB 伪代码):
function out = compact( in )
d = in
for i = 1:size(in, 2) %do (# of items in in) passes
m = d > 0
%shift left, pad w/ 0 on right
ml = [m(2:end) 0] % shift
dl = [d(2:end) 0] % shift
%if the data originally has a gap, fill it in w/ the
%left shifted one
use = (m == 0) & (ml == 1) %2 comparison
d = use .* dl + ~use .* d
%zero out elements that have been moved to the left
use_r = [0 use(1:end-1)]
d = d .* ~use_r
end
out = d(1 : size(find(in > 0), 2)) %truncate the end
end
直觉
每次迭代,我们将掩码向左移动并比较掩码。如果我们发现在这次移位之后,原来为 void(mask[i] = 0) 的索引现在有效(mask[i] = 1),则我们将索引设置为具有左移数据。
问题
上述算法的复杂度为 O(N * (3 次移位 + 2 次比较 + AND + 加法 + 3 次乘法))。有没有办法提高其效率?
Let's say I have an arrayk = [1 2 0 0 5 4 0]
I can compute a mask as followsm = k > 0 = [1 1 0 0 1 1 0]
Using only the mask m and the following operations
- Shift left / right
- And/Or
- Add/Subtract/Multiply
I can compact k into the following[1 2 5 4]
Here's how I currently do it (MATLAB pseudocode):
function out = compact( in )
d = in
for i = 1:size(in, 2) %do (# of items in in) passes
m = d > 0
%shift left, pad w/ 0 on right
ml = [m(2:end) 0] % shift
dl = [d(2:end) 0] % shift
%if the data originally has a gap, fill it in w/ the
%left shifted one
use = (m == 0) & (ml == 1) %2 comparison
d = use .* dl + ~use .* d
%zero out elements that have been moved to the left
use_r = [0 use(1:end-1)]
d = d .* ~use_r
end
out = d(1 : size(find(in > 0), 2)) %truncate the end
end
Intuition
Each iteration, we shift the mask left and compare the mask. We set a index to have the left shifted data if we find that after this shift, an index that was originally void(mask[i] = 0) is now valid(mask[i] = 1).
Question
The above algorithm has O(N * (3 shift + 2 comparison + AND + add + 3 multiplies)). Is there a way to improve its efficiency?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
原始伪代码没有太多需要优化的地方。我在这里看到了一些小改进:
use = (m == 0) & (ml == 1)
可能可以简化为use = ~m & ml
,~
算作单独的操作,最好使用倒置形式:use = m | ~ml
,d = ~use .* dl + use .* d
,use_r = [1 use(1:end-1)]
,d = d .*use_r
但发明更好的算法是可能的。算法的选择取决于所使用的CPU资源:
C++,64 位,子集宽度 = 8:
There is no much to optimize in the original pseudo-code. I see several small improvements here:
use = (m == 0) & (ml == 1)
probably may be simplified touse = ~m & ml
,~
is counted as separate operation, it would be better to use the inverted form :use = m | ~ml
,d = ~use .* dl + use .* d
,use_r = [1 use(1:end-1)]
,d = d .*use_r
But it is possible to invent better algorithms. And the choice of algorithm depends on CPU resources used:
C++, 64 bit, subset width = 8:
因此,您需要弄清楚对于这样一个简单的任务,额外的并行性、移位/洗牌开销是否值得。
如果您想采用并行 SIMD 路线,最好的选择是使用 SWITCH CASE,其中包含掩码接下来 4 位的所有可能排列。为什么不是8?因为 PSHUFD 指令只能在 XMMX m128 上洗牌,而不能在 YMMX m256 上洗牌。
所以你做了 16 种情况:
因此,每种情况都是最少量的处理(1 到 2 个 SIMD 指令和 1 个输出指针加法)。 case 语句的周围循环将处理常量输入指针加法(加 4)和 MOVDQA 来加载输入。
So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.
If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.
So you make 16 Cases:
So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.
原始代码一次仅移动数组元素一步。这可能会得到改善。可以对数组元素进行分组并一次将它们移动 2^k 步。
该算法的第一部分计算每个元素应移动多少步。第二部分移动元素 - 首先移动一步,然后移动 2 步,然后移动 4 步,等等。这可以正常工作,并且元素不会混合,因为每次移位后都有足够的空间来执行 2 倍大的移位。
Matlab,代码未测试:
上述算法的复杂度为 O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift))。
Original code moves array element only one step at a time. This may be improved. It is possible to group array elements and shift them 2^k steps at once.
First part of this algorithm computes how many steps should each element be shifted. Second part moves elements - first by one step, then by 2, then 4, etc. This works correctly and elements are not intermixed because after each shift there is enough space to perform 2 times larger shift.
Matlab, code not tested:
The above algorithm's complexity is O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift)).
阅读原始问题下面的评论,在实际问题中,数组包含 32 位浮点数,掩码是(一个?)32 位整数,所以我不明白为什么应该使用移位等压缩数组。简单的压缩算法(用 C 语言)将是这样的:
较小的变化将由掩码的位顺序引起,但唯一需要的 ALU 操作是索引变量更新以及掩码的移位和与操作。由于原始数组至少有 256 位宽,因此普通 CPU 无法按位移动整个数组。
Reading the comments below the original question, in the actual problem the array contains 32-bit floating point numbers, and the mask is (one?) 32-bit integer, so I don't get it why shifts etc. should be used for compacting the array. The simple compacting algorithm (in C) would be something like this:
Minor variations would be due to the bit order of the mask, but the only ALU operations that are needed are index variable updates and shifting and ANDing the mask. Because the original array is at least 256 bits wide, no usual CPU can shift the whole array bit-wise around.
假设您想要的是在 C++ 中以最少的步骤仅存储数组中的正整数,这是一个示例代码:
或者,如果您不想使用 <,也可以直接使用 k[ ] 的元素代码>for循环。
Assuming what you want is to store only positive integers from an array with minimum steps in C++ this is a sample code:
Or you can directly use elements of k[ ] if you don't want to use
for
loop.