达夫的设备如何工作?

发布于 2024-07-13 17:06:15 字数 177 浏览 15 评论 0原文

我已阅读有关 Duff 设备的维基百科文章,但我不明白它。 我真的很感兴趣,但我已经读了几次那里的解释,但我仍然不明白达夫的设备是如何工作的。

更详细的解释是什么?

I've read the article on Wikipedia on the Duff's device, and I don't get it. I am really interested, but I've read the explanation there a couple of times and I still don't get it how the Duff's device works.

What would a more detailed explanation be?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

胡渣熟男 2024-07-20 17:06:15

其他地方有一些很好的解释,但让我尝试一下。 (这在白板上要容易得多!)这是带有一些符号的维基百科示例。

假设您要复制 20 个字节。 第一遍程序的流程控制是:

int count;                        // Set to 20
{
    int n = (count + 7) / 8;      // n is now 3.  (The "while" is going
                                  //              to be run three times.)

    switch (count % 8) {          // The remainder is 4 (20 modulo 8) so
                                  // jump to the case 4

    case 0:                       // [skipped]
             do {                 // [skipped]
                 *to = *from++;   // [skipped]
    case 7:      *to = *from++;   // [skipped]
    case 6:      *to = *from++;   // [skipped]
    case 5:      *to = *from++;   // [skipped]
    case 4:      *to = *from++;   // Start here.  Copy 1 byte  (total 1)
    case 3:      *to = *from++;   // Copy 1 byte (total 2)
    case 2:      *to = *from++;   // Copy 1 byte (total 3)
    case 1:      *to = *from++;   // Copy 1 byte (total 4)
           } while (--n > 0);     // N = 3 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //        greater than 0 (and it is)
}

现在,开始第二遍,我们只运行指示的代码:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 5)
    case 7:      *to = *from++;   // Copy 1 byte (total 6)
    case 6:      *to = *from++;   // Copy 1 byte (total 7)
    case 5:      *to = *from++;   // Copy 1 byte (total 8)
    case 4:      *to = *from++;   // Copy 1 byte (total 9)
    case 3:      *to = *from++;   // Copy 1 byte (total 10)
    case 2:      *to = *from++;   // Copy 1 byte (total 11)
    case 1:      *to = *from++;   // Copy 1 byte (total 12)
           } while (--n > 0);     // N = 2 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it is)
}

现在,开始第三遍:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 13)
    case 7:      *to = *from++;   // Copy 1 byte (total 14)
    case 6:      *to = *from++;   // Copy 1 byte (total 15)
    case 5:      *to = *from++;   // Copy 1 byte (total 16)
    case 4:      *to = *from++;   // Copy 1 byte (total 17)
    case 3:      *to = *from++;   // Copy 1 byte (total 18)
    case 2:      *to = *from++;   // Copy 1 byte (total 19)
    case 1:      *to = *from++;   // Copy 1 byte (total 20)
           } while (--n > 0);     // N = 1  Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it's not, so bail)
}                                 // continue here...

现在复制了 20 个字节。

注意:原始 Duff 的设备(如上所示)复制到 to 地址处的 I/O 设备。 因此,没有必要增加指针*to。 在两个内存缓冲区之间复制时,您需要使用*to++

There are some good explanations elsewhere, but let me give it a try. (This is a lot easier on a whiteboard!) Here's the Wikipedia example with some notations.

Let's say you're copying 20 bytes. The flow control of the program for the first pass is:

int count;                        // Set to 20
{
    int n = (count + 7) / 8;      // n is now 3.  (The "while" is going
                                  //              to be run three times.)

    switch (count % 8) {          // The remainder is 4 (20 modulo 8) so
                                  // jump to the case 4

    case 0:                       // [skipped]
             do {                 // [skipped]
                 *to = *from++;   // [skipped]
    case 7:      *to = *from++;   // [skipped]
    case 6:      *to = *from++;   // [skipped]
    case 5:      *to = *from++;   // [skipped]
    case 4:      *to = *from++;   // Start here.  Copy 1 byte  (total 1)
    case 3:      *to = *from++;   // Copy 1 byte (total 2)
    case 2:      *to = *from++;   // Copy 1 byte (total 3)
    case 1:      *to = *from++;   // Copy 1 byte (total 4)
           } while (--n > 0);     // N = 3 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //        greater than 0 (and it is)
}

Now, start the second pass, we run just the indicated code:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 5)
    case 7:      *to = *from++;   // Copy 1 byte (total 6)
    case 6:      *to = *from++;   // Copy 1 byte (total 7)
    case 5:      *to = *from++;   // Copy 1 byte (total 8)
    case 4:      *to = *from++;   // Copy 1 byte (total 9)
    case 3:      *to = *from++;   // Copy 1 byte (total 10)
    case 2:      *to = *from++;   // Copy 1 byte (total 11)
    case 1:      *to = *from++;   // Copy 1 byte (total 12)
           } while (--n > 0);     // N = 2 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it is)
}

Now, start the third pass:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 13)
    case 7:      *to = *from++;   // Copy 1 byte (total 14)
    case 6:      *to = *from++;   // Copy 1 byte (total 15)
    case 5:      *to = *from++;   // Copy 1 byte (total 16)
    case 4:      *to = *from++;   // Copy 1 byte (total 17)
    case 3:      *to = *from++;   // Copy 1 byte (total 18)
    case 2:      *to = *from++;   // Copy 1 byte (total 19)
    case 1:      *to = *from++;   // Copy 1 byte (total 20)
           } while (--n > 0);     // N = 1  Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it's not, so bail)
}                                 // continue here...

20 bytes are now copied.

Note: The original Duff's Device (shown above) copied to an I/O device at the to address. Thus, it wasn't necessary to increment the pointer *to. When copying between two memory buffers you'd need to use *to++.

超可爱的懒熊 2024-07-20 17:06:15

Dr. Dobb's Journal 中的解释是我在该主题上找到的最好的解释。

这是我的 AHA 时刻:

for (i = 0; i < len; ++i) {
    HAL_IO_PORT = *pSource++;
}

变成:

int n = len / 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
}

n = len % 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
}

变成:

int n = (len + 8 - 1) / 8;
switch (len % 8) {
    case 0: do { HAL_IO_PORT = *pSource++;
    case 7: HAL_IO_PORT = *pSource++;
    case 6: HAL_IO_PORT = *pSource++;
    case 5: HAL_IO_PORT = *pSource++;
    case 4: HAL_IO_PORT = *pSource++;
    case 3: HAL_IO_PORT = *pSource++;
    case 2: HAL_IO_PORT = *pSource++;
    case 1: HAL_IO_PORT = *pSource++;
               } while (--n > 0);
}

The explanation in Dr. Dobb's Journal is the best that I found on the topic.

This being my AHA moment:

for (i = 0; i < len; ++i) {
    HAL_IO_PORT = *pSource++;
}

becomes:

int n = len / 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
}

n = len % 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
}

becomes:

int n = (len + 8 - 1) / 8;
switch (len % 8) {
    case 0: do { HAL_IO_PORT = *pSource++;
    case 7: HAL_IO_PORT = *pSource++;
    case 6: HAL_IO_PORT = *pSource++;
    case 5: HAL_IO_PORT = *pSource++;
    case 4: HAL_IO_PORT = *pSource++;
    case 3: HAL_IO_PORT = *pSource++;
    case 2: HAL_IO_PORT = *pSource++;
    case 1: HAL_IO_PORT = *pSource++;
               } while (--n > 0);
}
天煞孤星 2024-07-20 17:06:15

达夫的设备有两个关键点。 首先,我怀疑这是更容易理解的部分,循环被展开。 通过避免检查循环是否完成并跳回循环顶部所涉及的一些开销,以更大的代码大小换取更高的速度。 当CPU执行直线代码而不是跳跃代码时,它可以运行得更快。

第二个方面是switch语句。 它允许代码第一次跳到循环的中间。 对于大多数人来说,令人惊讶的是这样的事情是被允许的。 嗯,这是允许的。 执行从计算出的 case 标签开始,然后继续到每个连续的赋值语句,就像任何其他 switch 语句一样。 在最后一个 case 标签之后,执行到达循环底部,此时它跳回顶部。 循环的顶部位于 switch 语句内,因此不再重新评估 switch。

原始循环展开八次,因此迭代次数除以八。 如果要复制的字节数不是八的倍数,则剩余一些字节。 大多数一次复制字节块的算法将在末尾处理剩余字节,但 Duff 的设备在开头处理它们。 该函数计算 switch 语句的 count % 8 来计算余数,跳转到这些字节的 case 标签,然后复制它们。 然后循环继续复制八个字节的组。

There are two key things to Duff's device. First, which I suspect is the easier part to understand, the loop is unrolled. This trades larger code size for more speed by avoiding some of the overhead involved in checking whether the loop is finished and jumping back to the top of the loop. The CPU can run faster when it's executing straight-line code instead of jumping.

The second aspect is the switch statement. It allows the code to jump into the middle of the loop the first time through. The surprising part to most people is that such a thing is allowed. Well, it's allowed. Execution starts at the calculated case label, and then it falls through to each successive assignment statement, just like any other switch statement. After the last case label, execution reaches the bottom of the loop, at which point it jumps back to the top. The top of the loop is inside the switch statement, so the switch is not re-evaluated anymore.

The original loop is unwound eight times, so the number of iterations is divided by eight. If the number of bytes to be copied isn't a multiple of eight, then there are some bytes left over. Most algorithms that copy blocks of bytes at a time will handle the remainder bytes at the end, but Duff's device handles them at the beginning. The function calculates count % 8 for the switch statement to figure what the remainder will be, jumps to the case label for that many bytes, and copies them. Then the loop continues to copy groups of eight bytes.

天赋异禀 2024-07-20 17:06:15

duffs 设备的目的是减少在严格的 memcpy 实现中进行的比较次数。

假设您想要将“count”个字节从 b 复制到 a,直接的方法是执行以下操作:

  do {                      
      *a = *b++;            
  } while (--count > 0);

您需要比较 count 多少次来查看是否是0以上吗? ‘数’次。

现在,duff 设备使用了 switch case 的一个令人讨厌的无意副作用,它允许您减少 count / 8 所需的比较次数。

现在假设您想使用 duffs 设备复制 20 个字节,您需要多少次比较? 只有 3 个字节,因为您一次复制 8 个字节,除了 last 第一个字节,您只复制 4 个字节。

更新:您不必执行 8 次比较/case-in-switch 语句,但它是函数大小和速度之间的合理权衡。

The point of duffs device is to reduce the number of comparisons done in a tight memcpy implementation.

Suppose you want to copy 'count' bytes from b to a, the straight forward approach is to do the following:

  do {                      
      *a = *b++;            
  } while (--count > 0);

How many times do you need to compare count to see if it's a above 0? 'count' times.

Now, the duff device uses a nasty unintentional side effect of a switch case which allows you to reduce the number of comparisons needed to count / 8.

Now suppose you want to copy 20 bytes using duffs device, how many comparisons would you need? Only 3, since you copy eight bytes at a time except the last first one where you copy just 4.

UPDATED: You don't have to do 8 comparisons/case-in-switch statements, but it's reasonable a trade-off between function size and speed.

梦罢 2024-07-20 17:06:15

当我第一次阅读它时,我将其自动格式化为这样

void dsend(char* to, char* from, count) {
    int n = (count + 7) / 8;
    switch (count % 8) {
        case 0: do {
                *to = *from++;
                case 7: *to = *from++;
                case 6: *to = *from++;
                case 5: *to = *from++;
                case 4: *to = *from++;
                case 3: *to = *from++;
                case 2: *to = *from++;
                case 1: *to = *from++;
            } while (--n > 0);
    }
}

,但我不知道发生了什么。

也许不是在提出这个问题时,但现在维基百科有一个非常好的解释

凭借 C 中的两个属性,该设备是有效的、合法的 C:

  • 语言定义中 switch 语句的宽松规范。 在该设备发明时,这是 C 编程语言的第一版,它只要求 switch 的受控语句是语法上有效的(复合)语句,其中 case 标签可以出现在任何子语句的前面。 结合这样的事实,在没有break语句的情况下,控制流将从一个case标签控制的语句转移到下一个case标签控制的语句,这意味着代码指定了一系列的count副本顺序源地址到内存映射的输出端口。
  • 能够合法地跳转到 C 语言循环的中间。

When I read it for the first time, I autoformatted it to this

void dsend(char* to, char* from, count) {
    int n = (count + 7) / 8;
    switch (count % 8) {
        case 0: do {
                *to = *from++;
                case 7: *to = *from++;
                case 6: *to = *from++;
                case 5: *to = *from++;
                case 4: *to = *from++;
                case 3: *to = *from++;
                case 2: *to = *from++;
                case 1: *to = *from++;
            } while (--n > 0);
    }
}

and I had no idea what was happening.

Maybe not when this question was asked, but now Wikipedia has a very good explanation

The device is valid, legal C by virtue of two attributes in C:

  • Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall-through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
  • The ability to legally jump into the middle of a loop in C.
水晶透心 2024-07-20 17:06:15

1:Duffs 设备是循环展开的一种特殊实现。 如果您有一个操作要在循环中执行 N 次,则循环展开是一种适用的优化技术 - 您可以通过执行循环 N/n 次,然后在循环中内联(展开)循环代码 n 次来以程序大小换取速度,例如替换:

for (int i=0; i<N; i++) {
    // [The loop code...] 
}

with

for (int i=0; i<N/n; i++) {
    // [The loop code...]
    // [The loop code...]
    // [The loop code...]
    ...
    // [The loop code...] // n times!
}

如果 N % n == 0 则效果很好 - 不需要达夫! 如果情况并非如此,那么您必须处理其余部分 - 这很痛苦。

2:Duffs 设备与此标准循环展开有何不同?
当 N % n != 0 时,Duffs 设备只是处理剩余循环周期的一种巧妙方法。整个 do / while 按照标准循环展开执行 N / n 次(因为情况 0 适用)。 在last第一次运行循环时,情况启动,我们运行循环代码“剩余”次数 - 剩余的运行循环“正常”运行。

1: Duffs device is a particular implementation of loop unrolling. Loop unrolling is an optimisation technique applicable if you have an operation to perform N times in a loop - you can trade program size for speed by executing the loop N/n times and then in the loop inlining (unrolling) the loop code n times e.g. replacing:

for (int i=0; i<N; i++) {
    // [The loop code...] 
}

with

for (int i=0; i<N/n; i++) {
    // [The loop code...]
    // [The loop code...]
    // [The loop code...]
    ...
    // [The loop code...] // n times!
}

Which works great if N % n == 0 - no need for Duff! If that is not true then you have to handle the remainder - which is a pain.

2: How does Duffs device differ from this standard loop unrolling?
Duffs device is just a clever way of dealing with the remainder loop cycles when N % n != 0. The whole do / while executes N / n number of times as per standard loop unrolling (because the case 0 applies). On the last first run through the loop the case kicks in and we run the loop code the 'remainder' number of times - the remaining runs through the loop run 'normally'.

格子衫的從容 2024-07-20 17:06:15

尽管我不是 100% 确定您要什么,但这里是...

Duff 的设备解决的问题是循环展开问题之一(您无疑会在您发布的 Wiki 链接上看到这一点)。 这基本上等同于对内存占用的运行时效率的优化。 达夫的设备处理串行复制,而不仅仅是任何老问题,但它是如何通过减少需要在循环中进行比较的次数来进行优化的经典示例。

作为一个可能更容易理解的替代示例,假设您有一个想要循环的项目数组,并且每次都为它们添加 1...通常,您可能会使用 for 循环,并循环大约 100 次。 这看起来相当合乎逻辑,而且……但是,可以通过展开循环来进行优化(显然不会太远……或者您也可以不使用循环)。

所以一个常规的 for 循环:

for(int i = 0; i < 100; i++)
{
    myArray[i] += 1;
}

成为

for(int i = 0; i < 100; i+10)
{
    myArray[i] += 1;
    myArray[i+1] += 1;
    myArray[i+2] += 1;
    myArray[i+3] += 1;
    myArray[i+4] += 1;
    myArray[i+5] += 1;
    myArray[i+6] += 1;
    myArray[i+7] += 1;
    myArray[i+8] += 1;
    myArray[i+9] += 1;
}

Duff 的设备所做的就是用 C 语言实现这个想法,但是(正如您在 Wiki 上看到的那样)使用串行副本。 您在上面看到的展开示例中,与原始版本中的 100 次比较相比,有 10 次比较 - 这相当于一个次要但可能很重要的优化。

Though I'm not 100% sure what you're asking for, here goes...

The issue that Duff's device addresses is one of loop unwinding (as you'll no doubt have seen on the Wiki link you posted). What this basically equates to is an optimisation of run-time efficiency, over memory footprint. Duff's device deals with serial copying, rather than just any old problem, but is a classic example of how optimisations can be made by reducing the number of times that a comparison needs to be done in a loop.

As an alternative example, which may make it easier to understand, imagine you have an array of items you wish to loop over, and add 1 to them each time... ordinarily, you might use a for loop, and loop around 100 times. This seems fairly logical and, it is... however, an optimisation can be made by unwinding the loop (obviously not too far... or you may as well just not use the loop).

So a regular for loop:

for(int i = 0; i < 100; i++)
{
    myArray[i] += 1;
}

becomes

for(int i = 0; i < 100; i+10)
{
    myArray[i] += 1;
    myArray[i+1] += 1;
    myArray[i+2] += 1;
    myArray[i+3] += 1;
    myArray[i+4] += 1;
    myArray[i+5] += 1;
    myArray[i+6] += 1;
    myArray[i+7] += 1;
    myArray[i+8] += 1;
    myArray[i+9] += 1;
}

What Duff's device does is implement this idea, in C, but (as you saw on the Wiki) with serial copies. What you're seeing above, with the unwound example, is 10 comparisons compared to 100 in the original - this amounts to a minor, but possibly significant, optimisation.

心欲静而疯不止 2024-07-20 17:06:15

这是一个不详细的解释,我认为这是 Duff 设备的关键:

事实是,C 基本上是汇编语言的一个很好的外观(具体来说是 PDP-7 汇编;如果你研究过它,你会发现它有多么引人注目)相似之处是)。 而且,在汇编语言中,实际上并没有循环——只有标签和条件分支指令。 因此,循环只是整个指令序列的一部分,在某处带有标签和分支:

        instruction
label1: instruction
        instruction
        instruction
        instruction
        jump to label1 if some_condition

而 switch 指令在某种程度上向前分支/跳转:

        evaluate expression into register r
        compare r with first case value
        branch to first case label if equal
        compare r with second case value
        branch to second case label if equal
        etc....
first_case_label: 
        instruction
        instruction
second_case_label: 
        instruction
        instruction
        etc...

在汇编中,很容易想象如何组合这两个控制结构,当您想到时这样一来,他们在C语言中的组合就不再显得那么奇怪了。

Here's a non-detailed explanation which is what I feel to be the crux of Duff's device:

The thing is, C is basically a nice facade for assembly language (PDP-7 assembly to be specific; if you studied that you would see how striking the similarities are). And, in assembly language, you don't really have loops - you have labels and conditional-branch instructions. So the loop is just a part of the overall sequence of instructions with a label and a branch somewhere:

        instruction
label1: instruction
        instruction
        instruction
        instruction
        jump to label1 if some_condition

and a switch instruction is branching/jumping ahead somewhat:

        evaluate expression into register r
        compare r with first case value
        branch to first case label if equal
        compare r with second case value
        branch to second case label if equal
        etc....
first_case_label: 
        instruction
        instruction
second_case_label: 
        instruction
        instruction
        etc...

In assembly it's easily conceivable how to combine these two control structures, and when you think of it that way, their combination in C doesn't seem so weird anymore.

偷得浮生 2024-07-20 17:06:15

这是我发布到另一个有关达夫设备的问题的答案,该问题在问题作为重复项关闭之前得到了一些支持。 我认为它提供了一些有价值的背景来说明为什么您应该避免这种构造。

“这是Duff 的设备。这是一种展开循环的方法,可以避免添加辅助修复循环来处理循环迭代次数不知道是展开因子的精确倍数的情况,

我将强调它的缺点。

因为这里的大多数答案似乎总体上是积极的,所以 如果您只是将代码编写为简单的循环,那么编译器将很难对循环体进行任何优化,现代编译器应该能够为您处理展开,这样您就可以保持可读性和性能并拥有一些。希望将其他优化应用于循环体。

其他人引用的维基百科文章甚至说,当从 Xfree86 源代码中删除此“模式”时,性能实际上得到了提高,

这是盲目手动优化您碰巧认为可能的任何代码的典型结果 。它会阻止编译器正确完成其工作,使代码可读性较差,更容易出现错误,并且通常会减慢速度。 如果您一开始就以正确的方式做事,即编写简单的代码,然后分析瓶颈,然后进行优化,您甚至不会想到使用这样的东西。 无论如何,对于现代 CPU 和编译器来说是不行的。

理解它很好,但如果你真正使用它,我会感到惊讶。”

This is an answer I posted to another question about Duff's Device that got some upvaotes before the question was closed as a duplicate. I think it provides a bit of valuable context here on why you should avoid this construct.

"This is Duff's Device. It's a method of unrolling loops that avoids having to add a secondary fix-up loop to deal with times when the number of loop iteration isn't know to be an exact multiple of the unrolling factor.

Since most answers here seem to be generally positive about it I'm going to highlight the downsides.

With this code a compiler is going to struggle to apply any optimization to the loop body. If you just wrote the code as a simple loop a modern compiler should be able to handle the unrolling for you. This way you maintain readability and performance and have some hope of other optimizations being applied to the loop body.

The Wikipedia article referenced by others even says when this 'pattern' was removed from the Xfree86 source code performance actually improved.

This outcome is typical of blindly hand optimizing any code you happen to think might need it. It prevents the compiler from doing its job properly, makes your code less readable and more prone to bugs and typically slows it down. If you were doing things the right way in the first place, i.e. writing simple code, then profiling for bottlenecks, then optimizing, you'd never even think to use something like this. Not with a modern CPU and compiler anyway.

It's fine to understand it, but I'd be surprised if you ever actually use it."

蓝海 2024-07-20 17:06:15

这是使用 Duff 设备的 64 位 memcpy 的工作示例:

#include <iostream>
#include <memory>

inline void __memcpy(void* to, const void* from, size_t count)
{
    size_t numIter = (count  + 56) / 64;  // gives the number of iterations;  bit shift actually, not division
    size_t rest = count & 63; // % 64
    size_t rest7 = rest&7;
    rest -= rest7;

    // Duff's device with zero case handled:
    switch (rest) 
    {
        case 0:  if (count < 8)
                     break;
                 do { *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 56:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 48:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 40:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 32:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 24:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 16:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 8:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
                } while (--numIter > 0);
    }

    switch (rest7)
    {
        case 7: *(((unsigned char*)to)+6) = *(((unsigned char*)from)+6);
        case 6: *(((unsigned short*)to)+2) = *(((unsigned short*)from)+2); goto case4;
        case 5: *(((unsigned char*)to)+4) = *(((unsigned char*)from)+4);
        case 4: case4: *((unsigned long*)to) = *((unsigned long*)from); break; 
        case 3: *(((unsigned char*)to)+2) = *(((unsigned char*)from)+2);
        case 2: *((unsigned short*)to) = *((unsigned short*)from); break;
        case 1: *((unsigned char*)to) = *((unsigned char*)from);
    }
}

void main()
{
    static const size_t NUM = 1024;

    std::unique_ptr<char[]> str1(new char[NUM+1]);  
    std::unique_ptr<char[]> str2(new char[NUM+1]);

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        size_t idx = (i % 62);
        if (idx < 26)
            str1[i] = 'a' + idx;
        else
            if (idx < 52)
                str1[i] = 'A' + idx - 26;
            else
                str1[i] = '0' + idx - 52;
    }

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        memset(str2.get(), ' ', NUM); 
        __memcpy(str2.get(), str1.get(), i);
        if (memcmp(str1.get(), str2.get(), i) || str2[i] != ' ')
        {
            std::cout << "Test failed for i=" << i;
        }

    }

    return;
}


它处理零长度情况(在原始 Duff 设备中假设 num>0)。
函数 main() 包含 __memcpy 的简单测试用例。

Here is a working example for 64-bit memcpy with Duff's Device:

#include <iostream>
#include <memory>

inline void __memcpy(void* to, const void* from, size_t count)
{
    size_t numIter = (count  + 56) / 64;  // gives the number of iterations;  bit shift actually, not division
    size_t rest = count & 63; // % 64
    size_t rest7 = rest&7;
    rest -= rest7;

    // Duff's device with zero case handled:
    switch (rest) 
    {
        case 0:  if (count < 8)
                     break;
                 do { *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 56:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 48:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 40:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 32:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 24:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 16:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 8:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
                } while (--numIter > 0);
    }

    switch (rest7)
    {
        case 7: *(((unsigned char*)to)+6) = *(((unsigned char*)from)+6);
        case 6: *(((unsigned short*)to)+2) = *(((unsigned short*)from)+2); goto case4;
        case 5: *(((unsigned char*)to)+4) = *(((unsigned char*)from)+4);
        case 4: case4: *((unsigned long*)to) = *((unsigned long*)from); break; 
        case 3: *(((unsigned char*)to)+2) = *(((unsigned char*)from)+2);
        case 2: *((unsigned short*)to) = *((unsigned short*)from); break;
        case 1: *((unsigned char*)to) = *((unsigned char*)from);
    }
}

void main()
{
    static const size_t NUM = 1024;

    std::unique_ptr<char[]> str1(new char[NUM+1]);  
    std::unique_ptr<char[]> str2(new char[NUM+1]);

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        size_t idx = (i % 62);
        if (idx < 26)
            str1[i] = 'a' + idx;
        else
            if (idx < 52)
                str1[i] = 'A' + idx - 26;
            else
                str1[i] = '0' + idx - 52;
    }

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        memset(str2.get(), ' ', NUM); 
        __memcpy(str2.get(), str1.get(), i);
        if (memcmp(str1.get(), str2.get(), i) || str2[i] != ' ')
        {
            std::cout << "Test failed for i=" << i;
        }

    }

    return;
}


It handles zero-length case (in original Duff's Device there is assumption num>0).
Function main() contains simple test cases for __memcpy.

转身以后 2024-07-20 17:06:15

只是实验,发现了另一种变体,无需交错 switch 语句和 do-while-loop:

int n = (count + 1) / 8;
switch (count % 8)
{
    LOOP:
case 0:
    if(n-- == 0)
        break;
    putchar('.');
case 7:
    putchar('.');
case 6:
    putchar('.');
case 5:
    putchar('.');
case 4:
    putchar('.');
case 3:
    putchar('.');
case 2:
    putchar('.');
case 1:
    putchar('.');
default:
    goto LOOP;
}

从技术上讲,goto 仍然实现了一个循环,但是这个变体可能稍微更具可读性。

Just experimenting, found another variant getting along without interleaving switch statement and do-while-loop:

int n = (count + 1) / 8;
switch (count % 8)
{
    LOOP:
case 0:
    if(n-- == 0)
        break;
    putchar('.');
case 7:
    putchar('.');
case 6:
    putchar('.');
case 5:
    putchar('.');
case 4:
    putchar('.');
case 3:
    putchar('.');
case 2:
    putchar('.');
case 1:
    putchar('.');
default:
    goto LOOP;
}

Technically, the goto still implements a loop, but this variant might be slightly more readable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文