clang＆＃x2b;＆＃x2B;：为什么在添加另一个结构成员时未优化此memcpy loop-idiom？

发布于 2025-02-09 14:01:00 字数 1729 浏览 0 评论 0原文

鉴于此代码段

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

clang用foo用memcpy调用代替循环（ Godbolt ），给出RAPL的输出：

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

但是，当我取消注释第二个成员uint64_t b; in data in ，它不再执行此操作（< godbolt ）。有这样的原因，还是只是错过的优化？在后一种情况下，是否有任何技巧可以使Clang应用此优化？

我注意到，如果将value更改为类型data＆amp;而不是（即：删除临时，本地副本），则仍然应用memcpy优化（ godbolt ）。

编辑：彼得在评论中指出，这种简单 /较少的嘈杂方法发生了同样的事情：

void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = data_in[i];
    *data_out++ = value;
  }
}

问题仍然存在：为什么不优化它？

原文

Given this code snippet

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

clang replaces the loop in foo with a memcpy call, just as expected (godbolt), giving the Rpass output:

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

However, when I uncomment the second member uint64_t b; in Data, it doesn't do that anymore (godbolt). Is there a reason for this, or is this just a missed optimization? In the latter case, is there any trick to still make clang apply this optimization?

I noticed that if I change value to be of type Data& instead (i.e.: Remove the temporary, local copy), the memcpy optimization is still applied (godbolt).

Edit: Peter pointed out in the comments that the same thing happens with this simpler / less noisy method:

void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = data_in[i];
    *data_out++ = value;
  }
}

The question remains: Why is it not optimized?

分享到QQ

分享到微博