自旋锁总是需要内存屏障吗?在内存屏障上旋转是否昂贵?

发布于 2024-11-25 20:43:59 字数 2116 浏览 1 评论 0原文

我编写了一些可以在本地正常工作的无锁代码 在大多数情况下读取。

内存读取上的本地旋转是否必然意味着我 必须始终在旋转之前插入内存屏障 读?

(为了验证这一点,我设法制作了一个读者/作家 组合导致读者永远看不到 书面价值,在某些非常具体的情况下 条件——专用CPU、附属于CPU的进程、 优化器一直调高,没有做其他工作 循环——所以箭头确实指向那个方向,但我不是 完全确定旋转内存的成本 屏障。)

旋转穿过内存屏障的成本是多少,如果 缓存的存储缓冲区中没有任何内容可以刷新? 即,所有过程(在 C 中)正在做的是

while ( 1 ) {
    __sync_synchronize();
    v = value;
    if ( v != 0 ) {
        ... something ...
    }
}

我是否正确地假设它是免费的并且不会妨碍 内存总线有流量吗?

另一种表达方式是问:内存屏障是否起作用 不仅仅是:刷新存储缓冲区,应用 使其失效,并阻止编译器 跨其位置重新排序读/写?


反汇编, __sync_synchronize() 似乎翻译成:

lock orl

来自英特尔手册(对于新手来说同样模糊):

Volume 3A: System Programming Guide, Part 1 --   8.1.2

Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.

[...]

For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).

我的翻译:“当你说 LOCK 时,这会很昂贵,但我们 仅在必要时执行此操作。”


@BlankXavier:

我确实测试过,如果编写者没有显式地将写入从存储缓冲区中推出,并且它是该 CPU 上运行的唯一进程,则读者可能永远不会看看作者的效果(我可以用测试程序重现它,但正如我上面提到的,它只发生在特定的测试、特定的编译选项和专用的核心分配中——我的算法工作正常,只有当我得到好奇这是如何工作的并写了明确的测试我意识到它可能会出现问题)

我认为默认情况下简单的写入是WB写入(回写),这意味着它们不会立即被刷新,但读取将获取它们的最新值(我)。我认为他们称之为“存储转发”)。所以我为作者使用了 CAS 指令,我在 Intel 手册中发现了所有这些不同类型的写入实现(UC、WC、WT、WB、WP),Intel vol 3A 章节。 11-10,仍在了解它们,

我的不确定性在于读者:我从麦肯尼的论文中了解到,还有一个失效队列,一个从总线到缓存的传入失效队列。我不确定这部分是如何工作的。特别是,您似乎暗示循环正常读取(即非锁定,没有屏障,并且仅使用 易失性来确保优化器在编译后留下读取)每次都会检查到“无效队列” (如果存在这样的事情)。如果简单的读取不够好(即可以读取旧的缓存行,该行在排队失效之前仍然显示为有效(这对我来说也有点不连贯,但是失效队列如何工作?)),那么原子读取将是有必要的,我的问题是:这样的话,会对公交车有什么影响吗? (我认为可能不会。)

我仍在阅读英特尔手册,虽然我看到了关于存储转发的精彩讨论,但我还没有找到关于失效队列的很好的讨论。我决定将我的 C 代码转换为 ASM 并进行实验,我认为这是真正了解其工作原理的最佳方法。

I wrote some lock-free code that works fine with local
reads, under most conditions.

Does local spinning on a memory read necessarily imply I
have to ALWAYS insert a memory barrier before the spinning
read?

(To validate this, I managed to produce a reader/writer
combination which results in a reader never seeing the
written value, under certain very specific
conditions--dedicated CPU, process attached to CPU,
optimizer turned all the way up, no other work done in the
loop--so the arrows do point in that direction, but I'm not
entirely sure about the cost of spinning through a memory
barrier.)

What is the cost of spinning through a memory barrier if
there is nothing to be flushed in the cache's store buffer?
i.e., all the process is doing (in C) is

while ( 1 ) {
    __sync_synchronize();
    v = value;
    if ( v != 0 ) {
        ... something ...
    }
}

Am I correct to assume that it's free and it won't encumber
the memory bus with any traffic?

Another way to put this is to ask: does a memory barrier do
anything more than: flush the store buffer, apply the
invalidations to it, and prevent the compiler from
reordering reads/writes across its location?


Disassembling, __sync_synchronize() appears to translate into:

lock orl

From the Intel manual (similarly nebulous for the neophyte):

Volume 3A: System Programming Guide, Part 1 --   8.1.2

Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.

[...]

For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).

My translation: "when you say LOCK, this would be expensive, but we're
only doing it where necessary."


@BlankXavier:

I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).

I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.

My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)

I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

停顿的约定 2024-12-02 20:43:59

“xchg reg,[mem]”指令将通过内核的 LOCK 引脚发出其锁定意图的信号。该信号穿过其他内核并缓存到总线主控总线(PCI 变体等),总线主控总线将完成它们正在做的事情,最终 LOCKA(确认)引脚将向 CPU 发出 xchg 可能完成的信号。然后LOCK信号被关闭。此序列可能需要很长时间(数百个 CPU 周期或更多)才能完成。之后,其他核心的相应高速缓存线将失效,并且您将拥有一个已知状态,即已在核心之间同步的状态。

xchg 指令是实现原子锁所需的全部指令。如果锁定本身成功,您就可以访问已定义锁定以控制访问的资源。这样的资源可以是内存区域、文件、设备、函数或其他任何东西。尽管如此,程序员始终需要编写代码,在该资源被锁定时使用该资源,而在未锁定时则不使用该资源。通常,成功锁定后的代码序列应尽可能短,以便尽可能少地阻碍其他代码获取对资源的访问权限。

请记住,如果锁定不成功,您需要通过发出新的 xchg 重试。

“无锁”是一个有吸引力的概念,但它需要消除共享资源。如果您的应用程序有两个或多个内核同时读取和写入公共内存地址,则“无锁”不是一个选项。

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.

The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.

Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.

"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

—━☆沉默づ 2024-12-02 20:43:59

我可能没有正确理解这个问题,但是......

如果你正在旋转,一个问题是编译器优化你的旋转。挥发性解决了这个问题。

内存屏障(如果有的话)将由写入者而不是读取者向自旋锁发出。作者实际上必须使用它 - 这样做可以确保写入立即被推出,但无论如何它很快就会消失。

该屏障防止执行该代码的线程在其位置上重新排序,这是其其他成本。

I may well not properly have understood the question, but...

If you're spinning, one problem is the compiler optimizing your spin away. Volatile solves this.

The memory barrier, if you have one, will be issued by the writer to the spin lock, not the reader. The writer doesn't actually have to use one - doing so ensures the write is pushed out immediately, but it'll go out pretty soon anyway.

The barrier prevents for a thread executing that code re-ordering across it's location, which is its other cost.

美人骨 2024-12-02 20:43:59

请记住,屏障通常用于对内存访问集进行排序,因此您的代码很可能在其他地方也需要屏障。例如,屏障要求看起来像这样并不罕见:

while ( 1 ) {

    v = pShared->value;
    __acquire_barrier() ;

    if ( v != 0 ) {
        foo( pShared->something ) ;
    }
}

此屏障将阻止 if 块中的加载和存储(即:pShared->something)在value加载完成。一个典型的例子是,您有一些“生产者”,它使用 v != 0 存储来标记其他内存(pShared->something)位于某些内存中。其他预期状态,如:

pShared->something = 1 ;  // was 0
__release_barrier() ;
pShared->value = 1 ;  // was 0

在这种典型的生产者消费者场景中,您几乎总是需要成对的屏障,其中一个用于标记辅助内存可见的存储(以便在值存储之前看不到值存储的效果)某商店),以及消费者的一个障碍(以便在值加载完成之前不会启动某些加载)。

这些障碍也是特定于平台的。例如,在 powerpc 上(使用 xlC 编译器),您可以分别对消费者和生产者使用 __isync()__lwsync()。需要哪些障碍还可能取决于您用于存储和加载的机制。如果您使用了导致 intel LOCK(可能是隐式的)的原子内在函数,那么这将引入隐式屏障,因此您可能不需要任何东西。此外,您可能还需要明智地使用 易失性(或者最好使用在幕后执行此操作的原子实现),以便让编译器执行您想要的操作。

Keep in mind that barriers typically are used to order sets of memory accesses, so your code could very likely also need barriers in other places. For example, it wouldn't be uncommon for the barrier requirement to look like this instead:

while ( 1 ) {

    v = pShared->value;
    __acquire_barrier() ;

    if ( v != 0 ) {
        foo( pShared->something ) ;
    }
}

This barrier would prevent loads and stores in the if block (ie: pShared->something) from executing before the value load is complete. A typical example is that you have some "producer" that used a store of v != 0 to flag that some other memory (pShared->something) is in some other expected state, as in:

pShared->something = 1 ;  // was 0
__release_barrier() ;
pShared->value = 1 ;  // was 0

In this typical producer consumer scenario, you'll almost always need paired barriers, one for the store that flags that the auxiliary memory is visible (so that the effects of the value store aren't seen before the something store), and one barrier for the consumer (so that the something load isn't started before the value load is complete).

Those barriers are also platform specific. For example, on powerpc (using the xlC compiler), you'd use __isync() and __lwsync() for the consumer and producer respectively. What barriers are required may also depend on the mechanism that you use for the store and load of value. If you've used an atomic intrinsic that results in an intel LOCK (perhaps implicit), then this will introduce an implicit barrier, so you may not need anything. Additionally, you'll likely also need to judicious use of volatile (or preferably use an atomic implementation that does so under the covers) in order to get the compiler to do what you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文