clang++:为什么在添加另一个结构成员时未优化此memcpy loop-idiom?

发布于 2025-02-09 14:01:00 字数 1729 浏览 0 评论 0原文

鉴于此代码段

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

clang用foo用memcpy调用代替循环( Godbolt ),给出RAPL的输出:

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

但是,当我取消注释第二个成员uint64_t b; in data in ,它不再执行此操作(< godbolt )。有这样的原因,还是只是错过的优化?在后一种情况下,是否有任何技巧可以使Clang应用此优化?

我注意到,如果将value更改为类型data&amp;而不是(即:删除临时,本地副本),则仍然应用memcpy优化( godbolt )。


编辑:彼得在评论中指出,这种简单 /较少的嘈杂方法发生了同样的事情:

void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = data_in[i];
    *data_out++ = value;
  }
}

问题仍然存在:为什么不优化它?

Given this code snippet

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

clang replaces the loop in foo with a memcpy call, just as expected (godbolt), giving the Rpass output:

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

However, when I uncomment the second member uint64_t b; in Data, it doesn't do that anymore (godbolt). Is there a reason for this, or is this just a missed optimization? In the latter case, is there any trick to still make clang apply this optimization?

I noticed that if I change value to be of type Data& instead (i.e.: Remove the temporary, local copy), the memcpy optimization is still applied (godbolt).


Edit: Peter pointed out in the comments that the same thing happens with this simpler / less noisy method:

void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = data_in[i];
    *data_out++ = value;
  }
}

The question remains: Why is it not optimized?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文