在 64 位系统中复制 unsigned int 2 次和 unsigned long 1 次有什么区别?
64位系统上有什么区别
*(unsigned*)d = *(unsigned*)s;
d+=4; s+=4;
*(unsigned*)d = *(unsigned*)s;
d+=4; s+=4;
64位系统上和
*(unsigned long*)d = *(unsigned long*)s;
d+=8; s+=8;
?
What is the difference between
*(unsigned*)d = *(unsigned*)s;
d+=4; s+=4;
*(unsigned*)d = *(unsigned*)s;
d+=4; s+=4;
and
*(unsigned long*)d = *(unsigned long*)s;
d+=8; s+=8;
on 64bit systems?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
假设在填充位或严格的别名规则方面没有发生任何令人不快的情况,并且假设类型的大小如您所期望的那样,并且假设内存区域不重叠并且正确对齐,那么它们每个都会从从一个地方到另一个地方。
当然,除了实际效果之外,性能和/或代码大小可能存在差异。
如果您发现有问题,请查看发出的实际代码,这可能会告诉您出了什么问题。除非你打开了很多优化,甚至可能进行了优化,否则我不会立即明白为什么这些与 AMD64、Ubuntu 和 gcc 不等效。
我提到的可能出错的事情:
unsigned
一个unsigned long
具有填充位,如果是这样那么可能存在位模式,它们是一个或两个的陷阱表示,一旦取消引用,它们就会爆炸。s
和d
是将双精度型指针转换为uint8_t*
的结果,并且您查看生成的双精度型,那么在一种或两种情况下,您可能看不到更改的效果,因为您有非法的类型双关语。sizeof(long) == 4
那么两者不等价。long
在 64 位 Windows 系统上是 32 位,但在 64 位 Linux 系统上不是。d == s + 4
,则两个代码片段具有不同的效果。因此,除非编译器知道d
和s
指向完全不同的位置(这就是 C99>restrict
用于)。s
或d
对于int
正确对齐,但对于long
则不正确,那么就会存在差异。 (编辑:显然您可以启用或禁用 x86-64 上未对齐访问的硬件异常)。Provided that nothing unpleasant happens in respect of padding bits or strict aliasing rules, and assuming the sizes of the types are as you expect, and provided that the memory regions don't overlap, and are correctly aligned, then they each copy 8 bytes from one place to another.
Of course, aside from the practical effect there may be a difference in performance and/or code size.
If you're seeing something break, then look at the actual code emitted, that might tell you what has gone wrong. Unless you have a lot of optimization switched on, and maybe even with optimization, I don't immediately see why those wouldn't be equivalent with AMD64, Ubuntu, and gcc.
Things I've mentioned that could go wrong:
unsigned
anunsigned long
to have padding bits, and if so then there could be bit patterns which are trap representations of one or both, which could explode as soon as you dereference.s
andd
are the result of casting pointers-to-double touint8_t*
, and you look at the resulting double, then in one or both cases you might not see the effects of the change because you have an illegal type-pun.sizeof(long) == 4
then the two aren't equivalent.long
is 32 bits on 64bit Windows systems, just not 64bit Linux ones.d == s + 4
, then the two code snippets have different effect. Because of this, you won't see the first optimized to become the second unless the compiler knows thatd
ands
point to entirely different places (and that's what C99restrict
is for).s
ord
is correctly aligned forint
but notlong
then there's a difference. (Edit: apparently you can enable or disable hardware exceptions for unaligned access on x86-64).如果您需要恰好复制八个字节,为什么不使用 memcpy() ?
使用GCC,它将发出内联代码而不是调用库函数,因此它应该与您手写的内存复制一样快。
额外的好处是,您的代码可以在 ILP32 系统、LP64(大多数 64 位 Unix)和 LLP64 (win64) 上运行,甚至可以在具有严格对齐要求的系统上运行。
If you need to copy exactly eight byte, why not using memcpy() ?
Using GCC, it will emit inline code instead of calling the library function, so it should be as faster as your hand written memory copy.
Added bonuses, your code will work on ILP32 systems, LP64 (most 64bits Unix) and LLP64 (win64), and even on system with strict alignment requirements.
如果性能并不重要,您可能应该像另一个答案一样使用
memcpy()
。如果此代码在写入
*s
后不久出现,请匹配类型;如果此代码在读取*d
之前出现,则匹配类型。这将确保存储到负载转发(将数据从存储直接移动到负载,而不等待存储将数据写回数据缓存)将在尽可能多的 CPU 上运行。如果存储和加载的地址和大小匹配且对齐,则存储到加载转发几乎总是有效,并且可能会更频繁地工作,具体取决于 CPU。如果存储到加载转发失败,惩罚往往是 10 个时钟周期的量级。如果您可以通过添加额外的移位/和/或操作来避免存储到加载的转发问题,那么这通常会更快。
如果您更有效地使用 C 的类型系统并避免强制转换,则可以避免许多存储到加载的转发问题。
If performance is not critical, you should probably just use the
memcpy()
as in another answer.If this code occurs soon after a write to
*s
, match the types; if this code occurs soon before a read from*d
, match the types. This will ensure store-to-load forwarding (moving the data from the store directly to the load, without waiting for the store to write the data back into the data cache) will work on as many CPUs as possible. Store-to-load forwarding almost always works if the addresses and sizes of the store and load match and are aligned, and may work more often depending on CPU. If store-to-load forwarding fails, the penalty tends to be in the order of 10 clock cycles.If you can avoid a store-to-load forwarding problem by adding additional shift/and/or operations, this is often faster.
If you use C's type system more effectively and avoid casts, many store-to-load forwarding problems will be avoided.
尝试转换为 (unsigned long long*)
Try casting as (unsigned long long*)