将short 钳位为无符号字符

发布于 2024-11-05 04:34:49 字数 1053 浏览 1 评论 0原文

我有一个简单的 C 函数,如下所示:

unsigned char clamp(short value){
    if (value < 0) return 0;
    if (value > 0xff) return 0xff;
    return value;
}

是否可以在不使用任何 if / else 分支的情况下重写它,同时保持高效?

编辑:

我基本上希望看看是否可以进行一些基于位算术的钳位实现。目的是在GPU(图形处理单元)上处理图像。这种类型的代码将在每个像素上运行。我猜想如果可以避免分支,那么 GPU 上的总体吞吐量将会更高。

像 (value <0? 0 : ((value > 255) ? 255 : value) ) 这样的解决方案只是用语法糖重新编写 if/else 分支。所以我不寻找它。

编辑 2:

我可以将其缩减为单个 if ,如下所示,但我无法更好地思考:

unsigned char clamp(short value){
    int more = value >> 8;
    if(more){
        int sign = !(more >> 7);
        return sign * 0xff;
    }
    return value;
}

编辑 3:

刚刚在 FFmpeg 代码中看到了一个非常好的实现:

/**
 * Clip a signed integer value into the 0-255 range.
 * @param a value to clip
 * @return clipped value
 */
static av_always_inline av_const uint8_t av_clip_uint8_c(int a)
{
    if (a&(~0xFF)) return (-a)>>31;
    else           return a;
}

这当然有效,并将其缩减为一个 if 很好。

I have a simple C function as follows:

unsigned char clamp(short value){
    if (value < 0) return 0;
    if (value > 0xff) return 0xff;
    return value;
}

Is it possible to rewrite it without using any if / else branching while being efficient?

EDIT:

I basically wish to see if some bitwise arithmetic based implementation of clamping is possible. Objective is to process images on GPU (Graphics Processing Unit). This type of code will run on each pixel. I guess that if branches can be avoided, then overall throughput over the GPU would be higher.

A solution like (value <0? 0 : ((value > 255) ? 255 : value) ) is simply a rehash of if/else branching with syntactic sugar. So I am not looking for it.

EDIT 2:

I can cut it down to a single if as follows but I am not able to think better:

unsigned char clamp(short value){
    int more = value >> 8;
    if(more){
        int sign = !(more >> 7);
        return sign * 0xff;
    }
    return value;
}

EDIT 3:

Just saw a very nice implementation of this in FFmpeg code:

/**
 * Clip a signed integer value into the 0-255 range.
 * @param a value to clip
 * @return clipped value
 */
static av_always_inline av_const uint8_t av_clip_uint8_c(int a)
{
    if (a&(~0xFF)) return (-a)>>31;
    else           return a;
}

This certainly works and reduces it to one if nicely.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

闻呓 2024-11-12 04:34:49

您写道您希望避免在 GPU 上进行分支。确实,在并行环境中分支的成本可能非常高,因为必须评估两个分支或必须应用同步。但如果分支足够小,代码将比大多数算术更快。 CUDA C 最佳实践指南描述了原因:

有时,编译器可能 [..]
优化 if 或 switch 语句
通过使用分支预测来代替。
在这些情况下,任何扭曲都不能
发散。 [..]

当使用分支谓词时,没有
执行的指令
取决于控制条件
被跳过。相反,他们每个人都是
与每线程条件相关
设置为 true 的代码或谓词
或基于控制的错误
条件,虽然这些
指令已安排
执行,只有指令
真正的谓词实际上是
被执行。指令有假
谓词不写结果,并且
也不评估地址或读取
操作数。

分支预测速度很快。真他妈快!如果您查看优化编译器生成的中间 PTX 代码,您会发现它甚至优于普通算术。所以 davmac 的答案中的代码可能是尽可能快的。

我知道您没有具体询问 CUDA,但大多数最佳实践指南也适用于 OpenCL,可能还适用于 AMD GPU 编程的大部分内容。

顺便说一句:在我见过的几乎所有 GPU 代码中,大部分时间都花在内存访问上,而不是算术上。请务必进行简介! http://en.wikipedia.org/wiki/Program_optimization

You write that you want to avoid branching on the GPU. It is true, that branching can be very costly in a parallel environment because either both branches have to be evaluated or synchronization has to be applied. But if the branches are small enough the code will be faster than most arithmetic. The CUDA C best practices guide describes why:

Sometimes, the compiler may [..]
optimize out if or switch statements
by using branch predication instead.
In these cases, no warp can ever
diverge. [..]

When using branch predication none of
the instructions whose execution
depends on the controlling condition
gets skipped. Instead, each of them is
associated with a per-thread condition
code or predicate that is set to true
or false based on the controlling
condition and although each of these
instructions gets scheduled for
execution, only the instructions with
a true predicate are actually
executed. Instructions with a false
predicate do not write results, and
also do not evaluate addresses or read
operands.

Branch predication is fast. Bloody fast! If you look at the intermediate PTX code generated by the optimizing compiler you will see that it is superior to even modest arithmetic. So the code like in the answer of davmac is probably as fast as it can get.

I know you did not ask specifically about CUDA, but most of the best practices guide also applies to OpenCL and probably large parts of AMDs GPU programming.

BTW: in virtually every case of GPU code I have ever seen most of the time is spend on memory access, not on arithmetic. Make sure to profile! http://en.wikipedia.org/wiki/Program_optimization

落叶缤纷 2024-11-12 04:34:49

如果您只是想避免实际的 if/else,请使用 ? : 运算符:

return value < 0 ? 0 : (value > 0xff ? 0xff : value);

但是,就效率而言,这应该没有什么不同。

在实践中,您不应该担心像这样微不足道的事情的效率。让编译器进行优化。

If you just want to avoid the actual if/else, using the ? : operator:

return value < 0 ? 0 : (value > 0xff ? 0xff : value);

However, in terms of efficiency this shouldn't be any different.

In practice, you shouldn't worry about efficiency with something so trivial as this. Let the compiler do the optimization.

書生途 2024-11-12 04:34:49

你可以做一个 2D 查找表:

unsigned char clamp(short value)
{
  static const unsigned char table[256][256] = { ... }

  const unsigned char x = value & 0xff;
  const unsigned char y = (value >> 8) & 0xff;
  return table[y][x];
}

当然这看起来很奇怪(一个 64 KB 的表用于这个微不足道的计算)。然而,考虑到您提到您想在 GPU 上执行此操作,我认为上面的操作可能是纹理查找,我相信这在 GPU 上速度相当快。

此外,如果您的 GPU 使用 OpenGL,您当然可以使用 clamp 直接内置:

clamp(value, 0, 255);

这不会进行类型转换(GLSL 中似乎没有 8 位整数类型),但仍然如此。

You could do a 2D lookup-table:

unsigned char clamp(short value)
{
  static const unsigned char table[256][256] = { ... }

  const unsigned char x = value & 0xff;
  const unsigned char y = (value >> 8) & 0xff;
  return table[y][x];
}

Sure this looks bizarre (a 64 KB table for this trivial computation). However, considering that you mentioned you wanted to do this on a GPU, I'm thinking the above could be a texture look-up, which I believe are pretty quick on GPUs.

Further, if your GPU uses OpenGL, you could of course just use the clamp builtin directly:

clamp(value, 0, 255);

This won't type-convert (there is no 8-bit integer type in GLSL, it seems), but still.

甜是你 2024-11-12 04:34:49

您可以在没有显式 if 的情况下通过使用 ?: (如另一张海报所示)或使用 abs() 的有趣属性(它可以让您计算)来完成此操作两个值的最大值或最小值。

例如,表达式 (a + abs(a))/2 对于正数返回 a,否则返回 0的最大值) a0)。

为了

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

让自己相信这是有效的,这里有一个测试程序:

#include <stdio.h>

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

void test(short value)
{
  printf("clip(%d) = %d\n", value, clip(value));
}

int main()
{
  test(0);
  test(10);
  test(-10);
  test(255);
  test(265);
  return 0;
}

运行时,会打印

clip(0) = 0
clip(10) = 10
clip(-10) = 0
clip(255) = 255
clip(265) = 255

当然,有人可能会说 abs() 中可能有一个测试,但是 gcc -例如,O3 线性编译它:

clip:
    movswl  %di, %edi
    movl    %edi, %edx
    sarl    $31, %edx
    movl    %edx, %eax
    xorl    %edi, %eax
    subl    %edx, %eax
    addl    %edi, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %eax, %edx
    sarl    %edx
    movswl  %dx, %edx
    leal    255(%rdx), %eax
    subl    $255, %edx
    movl    %edx, %ecx
    sarl    $31, %ecx
    xorl    %ecx, %edx
    subl    %ecx, %edx
    subl    %edx, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %edx, %eax
    sarl    %eax
    ret

但请注意,这比原始函数效率低得多,原始函数编译为:

clip:
    xorl    %eax, %eax
    testw   %di, %di
    js      .L1
    movl    $-1, %eax
    cmpw    $255, %di
    cmovle  %edi, %eax
.L1:
    rep
    ret

但至少它回答了你的问题:)

You can do it without explicit if by using ?: as shown by another poster or by using interesting properties of abs() which lets you compute the maximum or minimum of two values.

For example, the expression (a + abs(a))/2 returns a for positive numbers and 0 otherwise (maximum of a and 0).

This gives

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

To convince yourself that this works, here is a test program:

#include <stdio.h>

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

void test(short value)
{
  printf("clip(%d) = %d\n", value, clip(value));
}

int main()
{
  test(0);
  test(10);
  test(-10);
  test(255);
  test(265);
  return 0;
}

When run, this prints

clip(0) = 0
clip(10) = 10
clip(-10) = 0
clip(255) = 255
clip(265) = 255

Of course, one may argue that there is probably a test in abs(), but gcc -O3 for example compiles it linearly:

clip:
    movswl  %di, %edi
    movl    %edi, %edx
    sarl    $31, %edx
    movl    %edx, %eax
    xorl    %edi, %eax
    subl    %edx, %eax
    addl    %edi, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %eax, %edx
    sarl    %edx
    movswl  %dx, %edx
    leal    255(%rdx), %eax
    subl    $255, %edx
    movl    %edx, %ecx
    sarl    $31, %ecx
    xorl    %ecx, %edx
    subl    %ecx, %edx
    subl    %edx, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %edx, %eax
    sarl    %eax
    ret

But note that this will be much more inefficient than your original function, which compiles as:

clip:
    xorl    %eax, %eax
    testw   %di, %di
    js      .L1
    movl    $-1, %eax
    cmpw    $255, %di
    cmovle  %edi, %eax
.L1:
    rep
    ret

But at least it answers your question :)

蓦然回首 2024-11-12 04:34:49

怎么样:

unsigned char clamp (short value) {
    unsigned char r = (value >> 15);          /* uses arithmetic right-shift */
    unsigned char s = !!(value & 0x7f00) * 0xff;
    unsigned char v = (value & 0xff);
    return (v | s) & ~r;
}

但我严重怀疑它的执行速度是否比涉及分支的原始版本更快。

How about:

unsigned char clamp (short value) {
    unsigned char r = (value >> 15);          /* uses arithmetic right-shift */
    unsigned char s = !!(value & 0x7f00) * 0xff;
    unsigned char v = (value & 0xff);
    return (v | s) & ~r;
}

But I seriously doubt that it executes any faster than your original version involving branches.

自由如风 2024-11-12 04:34:49

假设短两个字节,并以代码可读性为代价:

clipped_x =  (x & 0x8000) ? 0 : ((x >> 8) ? 0xFF : x);

Assuming a two byte short, and at the cost of readability of the code:

clipped_x =  (x & 0x8000) ? 0 : ((x >> 8) ? 0xFF : x);
你是我的挚爱i 2024-11-12 04:34:49

你应该为这个丑陋但仅算术的版本计时。

unsigned char clamp(short value){
  short pmask = ((value & 0x4000) >> 7) | ((value & 0x2000) >> 6) |
    ((value & 0x1000) >> 5) | ((value & 0x0800) >> 4) |
    ((value & 0x0400) >> 3) | ((value & 0x0200) >> 2) |
    ((value & 0x0100) >> 1);
  pmask |= (pmask >> 1) | (pmask >> 2) | (pmask >> 3) | (pmask >> 4) |
    (pmask >> 5) | (pmask >> 6) | (pmask >> 7);
  value |= pmask;
  short nmask = (value & 0x8000) >> 8;
  nmask |= (nmask >> 1) | (nmask >> 2) | (nmask >> 3) | (nmask >> 4) |
    (nmask >> 5) | (nmask >> 6) | (nmask >> 7);
  value &= ~nmask;
  return value;
}

You should time this ugly but arithmetic-only version.

unsigned char clamp(short value){
  short pmask = ((value & 0x4000) >> 7) | ((value & 0x2000) >> 6) |
    ((value & 0x1000) >> 5) | ((value & 0x0800) >> 4) |
    ((value & 0x0400) >> 3) | ((value & 0x0200) >> 2) |
    ((value & 0x0100) >> 1);
  pmask |= (pmask >> 1) | (pmask >> 2) | (pmask >> 3) | (pmask >> 4) |
    (pmask >> 5) | (pmask >> 6) | (pmask >> 7);
  value |= pmask;
  short nmask = (value & 0x8000) >> 8;
  nmask |= (nmask >> 1) | (nmask >> 2) | (nmask >> 3) | (nmask >> 4) |
    (nmask >> 5) | (nmask >> 6) | (nmask >> 7);
  value &= ~nmask;
  return value;
}
—━☆沉默づ 2024-11-12 04:34:49

提高效率的一种方法是将此函数声明为内联函数,以避免函数调用费用。您还可以使用三级运算符将其转换为宏,但这将删除编译器的返回类型检查。

One way to make it efficient is to declare this function as inline to avoid function calling expense. you could also turn it into macro using tertiary operator but that will remove the return type checking by compiler.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文