如何使用movntdqa避免缓存污染?
我正在尝试编写一个 memcpy 函数,该函数不会将源内存加载到 CPU 缓存中。 目的是避免缓存污染。 下面的 memcpy 函数可以工作,但会像标准 memcpy 一样污染缓存。 我正在使用带有 Visual C++ 2008 Express 的 P8700 处理器。 我通过 intel vtune 查看了 cpu 缓存的使用情况。
void memcpy(char *dst,char*src,unsigned size){
char *dst_end=dst+size;
while(dst!=dst_end){
__m128i res = _mm_stream_load_si128((__m128i *)src);
*((__m128i *)dst)=res;
src+=16;
dst+=16;
}
}
我有另一个版本,具有相同的结果 - 有效但会污染缓存。
void memcpy(char *dst,char*src,unsigned size){
char *dst_end = dst+size;
__asm{
mov edi, dst
mov edx, dst_end
mov esi,src
inner_start:
LFENCE
MOVNTDQA xmm0, [esi ]
MOVNTDQA xmm1, [esi+16]
MOVNTDQA xmm2, [esi+32]
MOVNTDQA xmm3, [esi+48]
//19. ; Copy data to buffer
MOVDQA [edi], xmm0
MOVDQA [edi+16], xmm1
MOVDQA [edi+32], xmm2
MOVDQA [edi+48], xmm3
// 25. ; Increment pointers by cache line size and test for end of loop
add esi, 040h
add edi, 040h
cmp edi, edx
jne inner_start
}
}
更新:这是测试程序
void test(int table_size,int num_iter,int item_size){
char *src_table=alloc_aligned(table_size*item_size);//return value is aligned on 64 bytes
char *dst=alloc_aligned(item_size); //destination is always the same buffer
for (int i=0;i<num_iter;i++){
int location=my_rand()%table_size;
char *src=src_table+location*item_size;//selecting a different src every time
memcpy(dst,src,item_size);
}
}
main(){
test(1024*32,1024*1024,1024*32)
}
i am trying to write a memcpy function that does not load the source memory to the cpu cache. The purpose is to avoid cache pollution.
The memcpy function below works, but pollutes the cache like the standard memcpy does. i am using P8700 proccesoor with visual C++ 2008 express. i see the cpu cache usage with intel vtune.
void memcpy(char *dst,char*src,unsigned size){
char *dst_end=dst+size;
while(dst!=dst_end){
__m128i res = _mm_stream_load_si128((__m128i *)src);
*((__m128i *)dst)=res;
src+=16;
dst+=16;
}
}
i have another version, that have the same results - works but pollutes the cache.
void memcpy(char *dst,char*src,unsigned size){
char *dst_end = dst+size;
__asm{
mov edi, dst
mov edx, dst_end
mov esi,src
inner_start:
LFENCE
MOVNTDQA xmm0, [esi ]
MOVNTDQA xmm1, [esi+16]
MOVNTDQA xmm2, [esi+32]
MOVNTDQA xmm3, [esi+48]
//19. ; Copy data to buffer
MOVDQA [edi], xmm0
MOVDQA [edi+16], xmm1
MOVDQA [edi+32], xmm2
MOVDQA [edi+48], xmm3
// 25. ; Increment pointers by cache line size and test for end of loop
add esi, 040h
add edi, 040h
cmp edi, edx
jne inner_start
}
}
update: this is the test program
void test(int table_size,int num_iter,int item_size){
char *src_table=alloc_aligned(table_size*item_size);//return value is aligned on 64 bytes
char *dst=alloc_aligned(item_size); //destination is always the same buffer
for (int i=0;i<num_iter;i++){
int location=my_rand()%table_size;
char *src=src_table+location*item_size;//selecting a different src every time
memcpy(dst,src,item_size);
}
}
main(){
test(1024*32,1024*1024,1024*32)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
引用自 英特尔:
这解释了为什么代码不起作用——内存是 WB 类型。
Quoting from Intel:
That explains why the code does not work — the memory is of type WB.