memcpy 是否有标准的跨步版本?
我有一个列向量 A,它有 10 个元素长。我有一个 10 x 10 的矩阵 B。B 的内存存储是列主的。我想用列向量 A 覆盖 B 中的第一行。
显然,我可以这样做:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
将 0 + 10 * i
中的零保留为强调 B 使用列优先存储(零是行索引)。
今晚在 CUDA 领域进行了一些恶作剧之后,我想到可能有一个 CPU 函数来执行跨步 memcpy?我猜想在低级别上,性能将取决于跨步加载/存储指令的存在,我不记得 x86 程序集中是否存在该指令?
I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.
Clearly, I can do:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
where I've left the zero in 0 + 10 * i
to highlight that B uses column-major storage (zero is the row-index).
After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简短的回答:您编写的代码与它将得到的一样快。
长答案:memcpy 函数是使用一些复杂的内部函数或汇编编写的,因为它对具有任意大小和对齐方式的内存操作数进行操作。如果您要覆盖矩阵的列,那么您的操作数将具有自然对齐,并且您不需要诉诸相同的技巧来获得不错的速度。
Short answer: The code you have written is as fast as it's going to get.
Long answer: The
memcpy
function is written using some complicated intrinsics or assembly because it operates on memory operands that have arbitrary size and alignment. If you are overwriting a column of a matrix, then your operands will have natural alignment, and you won't need to resort to the same tricks to get decent speed.