InterlockedExchange 和内存对齐

发布于 2024-07-20 06:15:34 字数 845 浏览 7 评论 0原文

我很困惑微软说 InterlockedExchange 需要内存对齐,但是英特尔文档说 LOCK 不需要内存对齐。 我错过了什么吗? 感谢

来自 Microsoft MSDN Library

平台 SDK:DLL、进程和线程 InterlockedExchange

Target参数指向的变量必须在32位边界上对齐; 否则,此函数将在多处理器 x86 系统和任何非 x86 系统上表现异常。

来自英特尔软件开发人员手册;

  • LOCK 指令 导致处理器的 LOCK# 信号在执行附带指令期间被置位(将指令转变为原子指令)。 在多处理器环境中,LOCK# 信号可确保在该信号有效时处理器独占使用任何共享内存

    LOCK 前缀的完整性不受内存字段对齐的影响。 观察到任意未对齐字段的内存锁定。

  • P6 和更新的处理器系列中的内存排序

    锁定指令有一个总顺序。

  • 软件控制总线锁定

    总线锁的完整性不受内存字段对齐的影响。 LOCK 语义遵循更新整个操作数所需的尽可能多的总线周期。但是,建议锁定访问在其自然边界上对齐,以获得更好的系统性能: • 8 位访问的任何边界(锁定或其他)。 • 锁定字访问的16 位边界。 • 锁定双字访问的32 位边界。 • 锁定四字访问的64 位边界。

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
thanks

from Microsoft MSDN Library

Platform SDK: DLLs, Processes, and Threads
InterlockedExchange

The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.

from Intel Software Developer’s Manual;

  • LOCK instruction
    Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.

    The integrity of the LOCK prefix is not affected by the alignment of the memory field.
    Memory locking is observed for arbitrarily misaligned fields.

  • Memory Ordering in P6 and More Recent Processor Families

    Locked instructions have a total order.

  • Software Controlled Bus Locking

    The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
    •Any boundary for an 8-bit access (locked or otherwise).
    •16-bit boundary for locked word accesses.
    •32-bit boundary for locked doubleword accesses.
    •64-bit boundary for locked quadword accesses.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

旧人 2024-07-27 06:15:34

曾几何时,Microsoft 在 x86 以外的处理器上支持 WindowsNT,例如 MIPS、PowerPC 和 Alpha。 这些处理器都需要对其互锁指令进行对齐,因此微软将这一要求放入其规范中,以确保这些原语可以移植到不同的体系结构。

Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.

温馨耳语 2024-07-27 06:15:34

尽管锁前缀不需要内存对齐,并且可能用于实现 InterlockedExchange() 的 cmpxchg 操作不需要对齐,但如果操作系统启用了对齐检查,那么 cmpxchg 将引发对齐检查异常(AC ) 当使用未对齐的操作数执行时。 检查 cmpxchg 和类似文档,查看保护模式异常列表。 我不确定 Windows 是否启用对齐检查,但这并不会让我感到惊讶。

Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.

狼性发作 2024-07-27 06:15:34

嘿,回答了一些与此相关的问题,也请记住;

  1. 然而,没有字节级InterlockedExchange一个16位 InterlockedExchange。
  2. 您提到的文档差异可能只是一些文档疏忽。
  3. 如果您想要进行字节/位级原子访问,有很多方法可以使用现有的内在函数来实现此目的,Interlocked[And8|Or8|Xor8]
  4. 任何执行高性能锁定的操作(使用像您讨论的机器代码),不应该操作未对齐的(性能反模式
  5. xchg(带有隐式 LOCK 前缀的优化指令,由于能够缓存锁定并避免完全总线锁定而进行优化主存储器)。 CAN进行8位互锁操作。

我差点忘了,来自 Intel 的 TBB,他们定义了加载/存储 8 位,不使用隐式或显式锁定(在某些情况下);

.code 
    ALIGN 4
    PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne load_slow
    ; Load within a cache line
    sub esp,12
    fild qword ptr [ecx]
    fistp qword ptr [esp]
    mov eax,[esp]
    mov edx,4[esp]
    add esp,12
    ret

EXTRN __TBB_machine_store8_slow:PROC
.code 
    ALIGN 4
    PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
    fild qword ptr 8[esp]
    fistp qword ptr [ecx]
    ret
end

不管怎样,希望这至少能为你解决一些问题。

Hey, I answered a few questions related to this, also keep in mind;

  1. There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
  2. The documentation discrepency you refer, is probably just some documentation oversight.
  3. If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
  4. Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
  5. xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.

I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);

.code 
    ALIGN 4
    PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne load_slow
    ; Load within a cache line
    sub esp,12
    fild qword ptr [ecx]
    fistp qword ptr [esp]
    mov eax,[esp]
    mov edx,4[esp]
    add esp,12
    ret

EXTRN __TBB_machine_store8_slow:PROC
.code 
    ALIGN 4
    PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
    fild qword ptr 8[esp]
    fistp qword ptr [ecx]
    ret
end

Anyhow, hope that clears at leat some of this up for you.

蓝眸 2024-07-27 06:15:34

我不明白你的英特尔信息来自哪里。

对我来说,很明显英特尔非常关心对齐和/或跨越缓存行。

例如,在 Core-i7 处理器上,您仍然必须确保数据不会跨越缓存行,否则不能保证操作是原子的。

在第 3-I 卷“系统编程,针对 x86/x64”中,英特尔明确指出:

8.1.1 有保证的原子操作

Intel486 处理器(以及此后更新的处理器)保证以下特性
基本的内存操作将始终以原子方式执行:

  • 读取或写入一个字节
  • 读取或写入在 16 位边界上对齐的字
  • 读取或写入在 32 位边界上对齐的双字

奔腾处理器(以及此后更新的处理器)保证以下特性
额外的内存操作将始终以原子方式执行:

  • 读取或写入在 64 位边界上对齐的四字
  • 对适合 32 位数据总线的未缓存内存位置进行 16 位访问

P6 系列处理器(以及此后更新的处理器)保证以下特性
额外的内存操作将始终以原子方式执行:

  • 对适合高速缓存的高速缓存内存进行未对齐的 16 位、32 位和 64 位访问
    线

对跨缓存行和页边界分割的可缓存内存的访问
Intel Core 2 Duo、Intel® Atom™、Intel Core 不保证原子性
Duo、Pentium M、Pentium 4、Intel Xeon、P6 系列、Pentium 和 Intel486 处理器。
英特尔酷睿 2 双核、英特尔凌动、英特尔酷睿双核、奔腾 M、奔腾 4、英特尔至强、
和 P6 系列处理器提供允许外部存储器的总线控制信号
使分割访问成为原子的子系统; 然而,非对齐数据访问将
严重影响处理器的性能,应该避免。

I don't understand where your Intel information is coming from.

To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.

For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.

On Volume 3-I, System Programming, For x86/x64 Intel clearly states:

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
    line

Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文