clang++:为什么在添加另一个结构成员时未优化此memcpy loop-idiom?
鉴于此代码段
#include <cstdint>
#include <cstddef>
struct Data {
uint64_t a;
//uint64_t b;
};
void foo(
void* __restrict data_out,
uint64_t* __restrict count_out,
std::byte* __restrict data_in,
uint64_t count_in)
{
for(uint64_t i = 0; i < count_in; ++i) {
Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
}
}
clang用foo
用memcpy调用代替循环( Godbolt ),给出RAPL的输出:
example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
但是,当我取消注释第二个成员uint64_t b;
in data
in ,它不再执行此操作(< godbolt )。有这样的原因,还是只是错过的优化?在后一种情况下,是否有任何技巧可以使Clang应用此优化?
我注意到,如果将value
更改为类型data&amp;
而不是(即:删除临时,本地副本),则仍然应用memcpy优化( godbolt )。
编辑:彼得在评论中指出,这种简单 /较少的嘈杂方法发生了同样的事情:
void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
for(uint64_t i = 0; i < count_in; ++i) {
Data value = data_in[i];
*data_out++ = value;
}
}
问题仍然存在:为什么不优化它?
Given this code snippet
#include <cstdint>
#include <cstddef>
struct Data {
uint64_t a;
//uint64_t b;
};
void foo(
void* __restrict data_out,
uint64_t* __restrict count_out,
std::byte* __restrict data_in,
uint64_t count_in)
{
for(uint64_t i = 0; i < count_in; ++i) {
Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
}
}
clang replaces the loop in foo
with a memcpy call, just as expected (godbolt), giving the Rpass output:
example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
However, when I uncomment the second member uint64_t b;
in Data
, it doesn't do that anymore (godbolt). Is there a reason for this, or is this just a missed optimization? In the latter case, is there any trick to still make clang apply this optimization?
I noticed that if I change value
to be of type Data&
instead (i.e.: Remove the temporary, local copy), the memcpy optimization is still applied (godbolt).
Edit: Peter pointed out in the comments that the same thing happens with this simpler / less noisy method:
void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
for(uint64_t i = 0; i < count_in; ++i) {
Data value = data_in[i];
*data_out++ = value;
}
}
The question remains: Why is it not optimized?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论