Java 比较和交换语义和性能

发布于 2024-10-03 05:34:15 字数 5094 浏览 9 评论 0原文

Java中比较和交换的语义是什么?也就是说,AtomicInteger 的比较和交换方法只是保证不同线程之间对原子整数实例的特定内存位置的有序访问,还是保证对内存中所有位置的有序访问,即它的作用就好像它是一个易失性的(内存栅栏)。

来自文档

  • weakCompareAndSet 原子地读取和有条件地写入变量,但不会创建任何发生前排序,因此对于除 weakCompareAndSet 的目标之外的任何变量的先前或后续读取和写入不提供任何保证>。
  • compareAndSet 以及所有其他读取和更新操作(例如getAndIncrement)都具有读取和写入易失性变量的内存效应。

从 API 文档中可以明显看出,compareAndSet 的作用就好像它是一个 volatile 变量。但是,weakCompareAndSet 应该只是更改其特定的内存位置。因此,如果该内存位置专属于单个处理器的缓存,则 weakCompareAndSet 应该比常规 compareAndSet 快得多。

我之所以这么问,是因为我通过运行 threadnum 不同的线程、将 threadnum 从 1 变化到 8、并设置 totalwork=1e9 对以下方法进行了基准测试code> (代码是用 Scala 编写的,Scala 是一种静态编译的 JVM 语言,但在本例中,其含义和字节码翻译与 Java 同构 - 这个简短的片段应该很清楚):

val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
  override def initialValue = new AtomicInteger(0)
}

def loop_atomic_tlocal_cas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.compareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_cnt
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_tlocal_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

在具有 4 个双 2.8 GHz 核心的 AMD 上和 2.67 GHz 4 核 i7 处理器。 JVM 是 Sun Server Hotspot JVM 1.6。结果显示没有性能差异。

规格:AMD 8220 4x 双核 @ 2.8 GHz

测试名称:loop_atomic_tlocal_cas

  • 线程数:1

运行时间:(显示最后 3 个) 7504.562 7502.817 7504.626(平均值 = 7415.637 分钟 = 7147.628 最大值 = 7504.886)

  • 线程编号:2

运行时间:(显示最后 3 个) 3751.553 3752.589 3751.519(平均值 = 3713.5513 分钟 = 3574.708 最大值 = 3752.949)

  • 线程编号:4

运行时间:(显示最后 3 个) 1890.055 1889.813 1890.047 (平均值 = 2065.7207 最小值 = 1804.652 最大值 = 3755.852)

  • 线程编号: 8

运行时间: (显示最后 3 个) 960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )


测试名称:loop_atomic_weakcas

  • 线程编号:1

运行时间:(显示最后 3 个) 7325.425 7057.03 7325.407(平均值 = 7231.8682 分钟 = 7057.03 最大值 = 7325.45)

  • 线程编号:2

运行时间:(显示最后 3 个) 3663.21 3665.838 3533.406(平均值 = 3607.2149 分钟 = 3529.177 最大值 = 3665.838)

  • 线程编号:4

运行时间:(显示最后 3 个) 3664.163 1831.979 1835.07 (平均值 = 2014.2086 最小值 = 1797.997 最大值 = 3664.163 )

  • 线程编号: 8

运行时间: (显示最后 3 次) 940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )


测试名称:loop_atomic_tlocal_weakcas

  • 线程数:1

运行时间:(显示最后 3 个) 7502.876 7502.857 7502.933(平均值 = 7414.8132 分钟 = 7145.869 最大值 = 7502.933)

  • 线程编号:2

运行时间:(显示最后 3 个) 3752.623 3751.53 3752.434 (平均值 = 3710.1782 最小值 = 3574.398 最大值 = 3752.623)

  • 线程编号: 4

运行时间: (显示最后 3 个) 1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351)

  • 线程编号:8

运行时间:(显示最后 3 个) 959.329 1010.53 969.767(平均值 = 1072.8444 最小值 = 959.329 最大值 = 1880.049)

规格:Intel i7 四核 @ 2.67 GHz

测试名称:loop_atomic_tlocal_cas

  • 线程数:1

运行时间:(显示最后 3 个) 8138.3175 8130.0044 8130.1535(平均值 = 8119.2888 分钟 = 8049.6497 最大值 = 8150.1950)

  • 线程编号:2

运行时间:(显示最后 3 个) 4067.7399 4067.5403 4068.3747(平均值 = 4059.6344 分钟 = 4026.2739 最大值 = 4068.5455)

  • 线程编号:4

运行时间:(显示最后 3 个) 2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352)


测试名称:loop_atomic_weakcas

  • 线程编号:1

运行时间:(显示最后 3 个) 8130.5620 8129.9963 8132.3382(平均值 = 8114.0052 分钟 = 8042.0742 最大值 = 8132.8542)

  • 线程编号:2

运行时间:(显示最后 3 个) 4066.9559 4067.0414 4067.2080(平均值 = 4086.0608 分钟 = 4023.6822 最大值 = 4335.1791)

  • 线程编号:4

运行时间:(显示最后 3 个) 2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )


测试名称:loop_atomic_tlocal_weakcas

  • 线程编号:1

运行时间:(显示最后 3 个) 8132.5267 8132.0299 8132.2415(平均值 = 8114.9328 分钟 = 8043.3674 最大值 = 8134.0418)

  • 线程编号:2

运行时间:(显示最后 3 个) 4066.5924 4066.5797 4066.6519(平均值 = 4059.1911 分钟 = 4025.0703 最大值 = 4066.8547)

  • 线程编号:4

运行时间:(显示最后 3 个) 2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )


虽然上面示例中的线程局部变量可能最终位于相同的缓存行中,但在我看来,常规 CAS 与其弱版本之间没有可观察到的性能差异。

这可能意味着,事实上,弱比较和交换充当完全成熟的内存栅栏,即充当易失性变量。

问题:这个观察正确吗?另外,是否存在已知的体系结构或 Java 发行版,其弱比较和设置实际上更快?如果不是,那么首先使用弱 CAS 的优势是什么?

What is the semantics of compare and swap in Java? Namely, does the compare and swap method of an AtomicInteger just guarantee ordered access between different threads to the particular memory location of the atomic integer instance, or does it guarantee ordered access to all the locations in memory, i.e. it acts as if it were a volatile (a memory fence).

From the docs:

  • weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
  • compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.

It's apparent from the API documentation that compareAndSet acts as if it were a volatile variable. However, weakCompareAndSet is supposed to just change its specific memory location. Thus, if that memory location is exclusive to the cache of a single processor, weakCompareAndSet is supposed to be much faster than the regular compareAndSet.

I'm asking this because I've benchmarked the following methods by running threadnum different threads, varying threadnum from 1 to 8, and having totalwork=1e9 (the code is written in Scala, a statically compiled JVM language, but both its meaning and bytecode translation are isomorphic to that of Java in this case - this short snippets should be clear):

val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
  override def initialValue = new AtomicInteger(0)
}

def loop_atomic_tlocal_cas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.compareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_cnt
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_tlocal_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

on an AMD with 4 dual 2.8 GHz cores, and a 2.67 GHz 4-core i7 processor. The JVM is Sun Server Hotspot JVM 1.6. The results show no performance difference.

Specs: AMD 8220 4x dual-core @ 2.8 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3)
7504.562 7502.817 7504.626 (avg = 7415.637 min = 7147.628 max = 7504.886 )

  • Thread num.: 2

Run times: (showing last 3)
3751.553 3752.589 3751.519 (avg = 3713.5513 min = 3574.708 max = 3752.949 )

  • Thread num.: 4

Run times: (showing last 3)
1890.055 1889.813 1890.047 (avg = 2065.7207 min = 1804.652 max = 3755.852 )

  • Thread num.: 8

Run times: (showing last 3)
960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3)
7325.425 7057.03 7325.407 (avg = 7231.8682 min = 7057.03 max = 7325.45 )

  • Thread num.: 2

Run times: (showing last 3)
3663.21 3665.838 3533.406 (avg = 3607.2149 min = 3529.177 max = 3665.838 )

  • Thread num.: 4

Run times: (showing last 3)
3664.163 1831.979 1835.07 (avg = 2014.2086 min = 1797.997 max = 3664.163 )

  • Thread num.: 8

Run times: (showing last 3)
940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3)
7502.876 7502.857 7502.933 (avg = 7414.8132 min = 7145.869 max = 7502.933 )

  • Thread num.: 2

Run times: (showing last 3)
3752.623 3751.53 3752.434 (avg = 3710.1782 min = 3574.398 max = 3752.623 )

  • Thread num.: 4

Run times: (showing last 3)
1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351 )

  • Thread num.: 8

Run times: (showing last 3)
959.329 1010.53 969.767 (avg = 1072.8444 min = 959.329 max = 1880.049 )

Specs: Intel i7 quad-core @ 2.67 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3)
8138.3175 8130.0044 8130.1535 (avg = 8119.2888 min = 8049.6497 max = 8150.1950 )

  • Thread num.: 2

Run times: (showing last 3)
4067.7399 4067.5403 4068.3747 (avg = 4059.6344 min = 4026.2739 max = 4068.5455 )

  • Thread num.: 4

Run times: (showing last 3)
2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3)
8130.5620 8129.9963 8132.3382 (avg = 8114.0052 min = 8042.0742 max = 8132.8542 )

  • Thread num.: 2

Run times: (showing last 3)
4066.9559 4067.0414 4067.2080 (avg = 4086.0608 min = 4023.6822 max = 4335.1791 )

  • Thread num.: 4

Run times: (showing last 3)
2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3)
8132.5267 8132.0299 8132.2415 (avg = 8114.9328 min = 8043.3674 max = 8134.0418 )

  • Thread num.: 2

Run times: (showing last 3)
4066.5924 4066.5797 4066.6519 (avg = 4059.1911 min = 4025.0703 max = 4066.8547 )

  • Thread num.: 4

Run times: (showing last 3)
2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )


While it's possible that thread locals in the example above end up in the same cache lines, it seems to me that there is no observable performance difference between regular CAS and its weak version.

This could mean that, in fact, a weak compare and swap acts as fully fledged memory fence, i.e. acts as if it were a volatile variable.

Question: Is this observation correct? Also, is there a known architecture or Java distribution for which a weak compare and set is actually faster? If not, what is the advantage of using a weak CAS in the first place?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

待天淡蓝洁白时 2024-10-10 05:34:15

当然,弱比较和交换可以充当完全易失性变量,具体取决于 JVM 的实现。事实上,如果在某些架构上无法以比普通 CAS 性能明显更高的方式实现弱 CAS,我不会感到惊讶。在这些架构上,弱 CAS 的实现方式很可能与完整 CAS 完全相同。或者可能只是因为您的 JVM 没有进行太多优化来使弱 CAS 变得特别快,因此当前的实现只是调用完整的 CAS,因为它实现起来很快,未来的版本将对此进行改进。

JLS 只是说弱 CAS 不会建立发生之前关系,因此无法保证它引起的修改在其他线程中可见。在这种情况下,您得到的只是保证比较和设置操作是原子的,但不能保证(可能的)新值的可见性。这与保证它不会被看到不同,因此您的测试与此一致。

一般来说,尽量避免通过实验对并发相关行为做出任何结论。有太多的变量需要考虑,如果您不遵循 JLS 所保证的正确性,那么您的程序可能随时崩溃(也许在不同的架构上,也许在更多的情况下)激进的优化是由代码布局的轻微变化引起的,也许在未来的 JVM 版本中尚不存在,等等)。 永远没有理由假设您可以侥幸逃脱那些声称无法保证的东西,因为实验表明“它有效”。

A weak compare and swap could act as a full volatile variable, depending on the implementation of the JVM, sure. In fact, I wouldn't be surprised if on certain architectures it is not possible to implement a weak CAS in a notably more performant way than the normal CAS. On these architectures, it may well be the case that weak CASes are implemented exactly the same as a full CAS. Or it might simply be that your JVM has not had much optimisation put into making weak CASes particularly fast, so the current implementation just invokes a full CAS because it's quick to implement, and a future version will refine this.

The JLS simply says that a weak CAS does not establish a happens-before relationship, so it's simply that there is no guarantee that the modification it causes is visible in other threads. All you get in this case is the guarantee that the compare-and-set operation is atomic, but with no guarantees about the visibility of the (potentially) new value. That's not the same as guaranteeing that it won't be seen, so your tests are consistent with this.

In general, try to avoid making any conclusions about concurrency-related behaviour through experimentation. There are so many variables to take into account, that if you don't follow what the JLS guarantees to be correct, then your program could break at any time (perhaps on a different architecture, perhaps under more aggressive optimisation that's prompted by a slight change in the layout of your code, perhaps under future builds of the JVM that don't exist yet, etc.). There's never a reason to assume you can get away with something that's stated not to be guaranteed, because experiments show that "it works".

笑看君怀她人 2024-10-10 05:34:15

用于“原子比较和交换”的 x86 指令是 LOCK CMPXCHG 。该指令创建一个完整的内存栅栏。

没有指令可以在不创建内存栅栏的情况下完成这项工作,因此很可能 compareAndSetweakCompareAndSet 都映射到 LOCK CMPXCHG 并且执行完整的内存围栏。

但这是针对 x86 的,其他架构(包括 x86 的未来变体)可能会采取不同的做法。

The x86 instruction for "atomically compare and swap" is LOCK CMPXCHG. This instruction creates a full memory fence.

There is no instruction that does this job without creating a memory fence, so it is very likely that both compareAndSet and weakCompareAndSet map to LOCK CMPXCHG and perform a full memory fence.

But that's for x86, other architectures (including future variants of x86) may do things differently.

萌吟 2024-10-10 05:34:15

weakCompareAndSwap 并不能保证更快;只是允许更快。您可以查看 OpenJDK 的开源代码,看看一些聪明的人决定使用此权限做什么:

即:它们都被实现为单行代码

return unsafe.compareAndSwapObject(this, valueOffset, expect, update);

它们具有完全相同的性能,因为它们具有完全相同的实现!(至少在 OpenJDK 中)。其他人评论说,无论如何,你在 x86 上都无法做得更好,因为硬件已经为你提供了一堆“免费”的保证。只有在像 ARM 这样的更简单的架构上你才需要担心它。

weakCompareAndSwap is not guaranteed to be faster; it's just permitted to be faster. You can look at the open-source code of the OpenJDK to see what some smart people decided to do with this permission:

Namely: They're both implemented as the one-liner

return unsafe.compareAndSwapObject(this, valueOffset, expect, update);

They have exactly the same performance, because they have exactly the same implementation! (in OpenJDK at least). Other people have remarked on the fact that you can't really do any better on x86 anyway, because the hardware already gives you a bunch of guarantees "for free". It's only on simpler architectures like ARM that you have to worry about it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文