ThreadLocal变量的性能

发布于 2024-07-14 22:07:43 字数 329 浏览 5 评论 0原文

从 ThreadLocal 变量读取比从常规字段读取慢多少?

更具体地说,简单的对象创建比访问 ThreadLocal 变量更快还是更慢?

我认为它足够快,因此拥有 ThreadLocal 实例比每次创建 MessageDigest 实例要快得多。 但这也适用于 byte[10] 或 byte[1000] 吗?

编辑:问题是调用 ThreadLocal 的 get 时到底发生了什么? 如果这只是一个领域,就像其他领域一样,那么答案将是“它总是最快的”,对吗?

How much is read from ThreadLocal variable slower than from regular field?

More concretely is simple object creation faster or slower than access to ThreadLocal variable?

I assume that it is fast enough so that having ThreadLocal<MessageDigest> instance is much faster then creating instance of MessageDigest every time. But does that also apply for byte[10] or byte[1000] for example?

Edit: Question is what is really going on when calling ThreadLocal's get? If that is just a field, like any other, then answer would be "it's always fastest", right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

染火枫林 2024-07-21 22:07:43

2009 年,一些 JVM 在 Thread.currentThread() 对象中使用不同步的 HashMap 实现了 ThreadLocal。 这使得它变得非常快(当然,尽管不如使用常规字段访问那么快),并且确保当 ThreadThreadLocal 对象得到整理。死了。 在 2016 年更新这个答案,似乎大多数(全部?)较新的 JVM 使用带有线性探测的 ThreadLocalMap 。 我不确定它们的性能 - 但我无法想象它比早期的实现要差得多。

当然,现在 new Object() 也非常快,垃圾收集器也非常擅长回收短命对象。

除非您确定对象创建会很昂贵,或者您需要在逐个线程的基础上保留某些状态,否则最好选择更简单的在需要时分配解决方案,并且仅切换到 ThreadLocal 当探查器告诉您需要时执行。

In 2009, some JVMs implemented ThreadLocal using an unsynchronised HashMap in the Thread.currentThread() object. This made it extremely fast (though not nearly as fast as using a regular field access, of course), as well as ensuring that the ThreadLocal object got tidied up when the Thread died. Updating this answer in 2016, it seems most (all?) newer JVMs use a ThreadLocalMap with linear probing. I am uncertain about the performance of those – but I cannot imagine it is significantly worse than the earlier implementation.

Of course, new Object() is also very fast these days, and the garbage collectors are also very good at reclaiming short-lived objects.

Unless you are certain that object creation is going to be expensive, or you need to persist some state on a thread by thread basis, you are better off going for the simpler allocate when needed solution, and only switching over to a ThreadLocal implementation when a profiler tells you that you need to.

迷荒 2024-07-21 22:07:43

运行未发布的基准测试,ThreadLocal.get 在我的机器上每次迭代大约需要 35 个周期。 没什么大不了的。 在 Sun 的实现中,Thread 中的自定义线性探测哈希映射将 ThreadLocal 映射到值。 由于它仅由单个线程访问,因此速度非常快。

小对象的分配需要相似的周期数,尽管由于缓存耗尽,您可能会在紧密循环中获得较低的数字。

MessageDigest 的构建可能相对昂贵。 它具有相当数量的状态,并且通过Provider SPI 机制进行构建。 您可以通过克隆或提供 Provider 等方式进行优化。

仅仅因为在 ThreadLocal 中缓存可能比创建更快,并不一定意味着系统性能会提高。 您将产生与 GC 相关的额外开销,这会减慢一切。

除非您的应用程序大量使用 MessageDigest,否则您可能需要考虑使用传统的线程安全缓存。

Running unpublished benchmarks, ThreadLocal.get takes around 35 cycle per iteration on my machine. Not a great deal. In Sun's implementation a custom linear probing hash map in Thread maps ThreadLocals to values. Because it is only ever accessed by a single thread, it can be very fast.

Allocation of small objects take a similar number of cycles, although because of cache exhaustion you may get somewhat lower figures in a tight loop.

Construction of MessageDigest is likely to be relatively expensive. It has a fair amount of state and construction goes through the Provider SPI mechanism. You may be able to optimise by, for instance, cloning or providing the Provider.

Just because it may be faster to cache in a ThreadLocal rather than create does not necessarily mean that the system performance will increase. You will have additional overheads related to GC which slows everything down.

Unless your application very heavily uses MessageDigest you might want to consider using a conventional thread-safe cache instead.

凡间太子 2024-07-21 22:07:43

好问题,我最近一直在问自己。 基准测试(在 Scala 中,编译为与等效 Java 代码几乎相同的字节码):

var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
  override def initialValue = ""
}

def loop_heap_write = {                                                                                                                           
  var i = 0                                                                                                                                       
  val until = totalwork / threadnum                                                                                                               
  while (i < until) {                                                                                                                             
    if (cnt ne "") cnt = "!"                                                                                                                      
    i += 1                                                                                                                                        
  }                                                                                                                                               
  cnt                                                                                                                                          
} 

def threadlocal = {
  var i = 0
  val until = totalwork / threadnum
  while (i < until) {
    if (tlocal.get eq null) i = until + i + 1
    i += 1
  }
  if (i > until) println("thread local value was null " + i)
}

为了给您确定的数字,下面的 main/scala/scala/threads/ParallelTests.scala" rel="noreferrer">此处 是在 AMD 4x 2.8 GHz 双核和具有超线程 (2.67 GHz) 的四核 i7 上执行的。

这些是数字:

i7

规格:Intel i7 2x 四核 @ 2.67 GHz
测试:scala.threads.ParallelTests

测试名称:loop_heap_read

线程编号:1
测试总数:200

运行时间:(显示最后 5 次)
9.0069 9.0036 9.0017 9.0084 9.0074 (平均值 = 9.1034 最小值 = 8.9986 最大值 = 21.0306 )

线程数: 2
测试总数:200

运行时间:(显示最后 5 次)
4.5563 4.7128 4.5663 4.5617 4.5724 (平均值 = 4.6337 最小值 = 4.5509 最大值 = 13.9476 )

螺纹数: 4
测试总数:200

运行时间:(显示最后 5 次)
2.3946 2.3979 2.3934 2.3937 2.3964 (平均值 = 2.5113 最小值 = 2.3884 最大值 = 13.5496 )

螺纹数: 8
测试总数:200

运行时间:(显示最后 5 次)
2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )

测试名称:threadlocal

线程编号:1
测试总数:200

运行时间:(显示最后 5 次)
91.1741 90.8978 90.6181 90.6200 90.6113 (平均值 = 91.0291 最小值 = 90.6000 最大值 = 129.7501 )

线程数: 2
测试总数:200

运行时间:(显示最后 5 次)
45.3838 45.3858 45.6676 45.3772 45.3839 (平均值 = 46.0555 最小值 = 45.3726 最大值 = 90.7108 )

线程数: 4
测试总数:200

运行时间:(显示最后 5 次)
22.8118 22.8135 59.1753 22.8229 22.8172(平均值 = 23.9752,最小值 = 22.7951,最大值 = 59.1753)

线程数:8
测试总数:200

运行时间:(显示最后 5 次)
22.2965 22.2415 22.3438 22.3109 22.4460(平均值 = 23.2676 最小值 = 22.2346 最大值 = 50.3583 )

AMD

规格:AMD 8220 4x 双核 @ 2.8 GHz
测试:scala.threads.ParallelTests

测试名称:loop_heap_read

总工作量:20000000
线程数: 1
测试总数:200

运行时间:(显示最后 5 次)
12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )

测试名称:loop_heap_read
总工作量:20000000

运行时间:(显示最后 5 次)
6.412 6.424 6.408 6.397 6.43(平均值 = 6.5367,最小值 = 6.393,最大值 = 19.716)

螺纹编号:4
测试总数:200

运行时间:(显示最后 5 次)
3.385 4.298 9.7 6.535 3.385(平均值 = 5.6079,最小值 = 3.354,最大值 = 21.603)

螺纹数:8
测试总数:200

运行时间:(显示最后 5 次)
5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )

测试名称:threadlocal

线程编号:1
测试总数:200

运行时间:(显示最后 5 次)
200.217 207.335 200.241 207.342 200.23 (平均值 = 202.2424 最小值 = 200.184 最大值 = 245.369 )

线程数: 2
测试总数:200

运行时间:(显示最后 5 次)
100.208 100.199 100.211 103.781 100.215 (平均值 = 102.2238 最小值 = 100.192 最大值 = 129.505 )

线程数: 4
测试总数:200

运行时间:(显示最后 5 次)
62.101 67.629 62.087 52.021 55.766(平均值 = 65.6361,最小值 = 50.282,最大值 = 167.433)

线程数:8
测试总数:200

运行时间:(显示最后 5 次)
40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )

摘要

线程本地读取量约为堆读取量的 10-20 倍。 它似乎在这个 JVM 实现和这些带有处理器数量的架构上也能很好地扩展。

Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):

var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
  override def initialValue = ""
}

def loop_heap_write = {                                                                                                                           
  var i = 0                                                                                                                                       
  val until = totalwork / threadnum                                                                                                               
  while (i < until) {                                                                                                                             
    if (cnt ne "") cnt = "!"                                                                                                                      
    i += 1                                                                                                                                        
  }                                                                                                                                               
  cnt                                                                                                                                          
} 

def threadlocal = {
  var i = 0
  val until = totalwork / threadnum
  while (i < until) {
    if (tlocal.get eq null) i = until + i + 1
    i += 1
  }
  if (i > until) println("thread local value was null " + i)
}

available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).

These are the numbers:

i7

Specs: Intel i7 2x quad-core @ 2.67 GHz
Test: scala.threads.ParallelTests

Test name: loop_heap_read

Thread num.: 1
Total tests: 200

Run times: (showing last 5)
9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )

Thread num.: 2
Total tests: 200

Run times: (showing last 5)
4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )

Thread num.: 4
Total tests: 200

Run times: (showing last 5)
2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )

Thread num.: 8
Total tests: 200

Run times: (showing last 5)
2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )

Test name: threadlocal

Thread num.: 1
Total tests: 200

Run times: (showing last 5)
91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )

Thread num.: 2
Total tests: 200

Run times: (showing last 5)
45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )

Thread num.: 4
Total tests: 200

Run times: (showing last 5)
22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )

Thread num.: 8
Total tests: 200

Run times: (showing last 5)
22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )

AMD

Specs: AMD 8220 4x dual-core @ 2.8 GHz
Test: scala.threads.ParallelTests

Test name: loop_heap_read

Total work: 20000000
Thread num.: 1
Total tests: 200

Run times: (showing last 5)
12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )

Test name: loop_heap_read
Total work: 20000000

Run times: (showing last 5)
6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )

Thread num.: 4
Total tests: 200

Run times: (showing last 5)
3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )

Thread num.: 8
Total tests: 200

Run times: (showing last 5)
5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )

Test name: threadlocal

Thread num.: 1
Total tests: 200

Run times: (showing last 5)
200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )

Thread num.: 2
Total tests: 200

Run times: (showing last 5)
100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )

Thread num.: 4
Total tests: 200

Run times: (showing last 5)
62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )

Thread num.: 8
Total tests: 200

Run times: (showing last 5)
40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )

Summary

A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.

圈圈圆圆圈圈 2024-07-21 22:07:43

@Pete 是优化之前的正确测试。

如果与实际使用 MessageDigest 相比,构建 MessageDigest 有任何严重的开销,我会感到非常惊讶。

错过使用 ThreadLocal 可能会成为泄漏和悬空引用的根源,这些引用没有明确的生命周期,通常,如果没有非常明确的计划何时删除特定资源,我不会使用 ThreadLocal。

@Pete is correct test before you optimise.

I would be very surprised if constructing a MessageDigest has any serious overhead when compared to actaully using it.

Miss using ThreadLocal can be a source of leaks and dangling references, that don't have a clear life cycle, generally I don't ever use ThreadLocal without a very clear plan of when a particular resource will be removed.

热风软妹 2024-07-21 22:07:43

这是另一个测试。 结果表明,ThreadLocal 比常规字段慢一点,但顺序相同。 大约慢 12%

public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;

public static void main(String[] args) throws InterruptedException {
    int execs = 10;
    for (int i = 0; i < execs; i++) {
        new FieldExample().run(i);
        new ThreadLocaldExample().run(i);
    }
    System.out.println("Field avg:"+(fieldExecTime / execs));
    System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}

private static class FieldExample {
    private Map<String,String> map = new HashMap<String, String>();

    public void run(int z) {
        System.out.println(z+"-Running  field sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            map.put(s,"a");
            map.remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        fieldExecTime += t;
        System.out.println(z+"-End field sample:"+t);
    }
}

private static class ThreadLocaldExample{
    private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
        @Override protected Map<String, String> initialValue() {
            return new HashMap<String, String>();
        }
    };

    public void run(int z) {
        System.out.println(z+"-Running thread local sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            myThreadLocal.get().put(s, "a");
            myThreadLocal.get().remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        threadLocalExecTime += t;
        System.out.println(z+"-End thread local sample:"+t);
    }
}
}'

输出:

0-运行字段样本

0-结束字段样本:6044

0-运行线程本地样本

0-结束线程本地样本:6015

1-运行字段样本

1-结束字段样本:5095

1-运行线程本地样本

1-端线程本地样本:5720

2-运行字段样本

2-端字段样本:4842

2-运行线程本地样本

2-端线程本地样本:5835

3-运行字段样本

3-端字段样本:4674

3-运行线程本地样本

3-端线程本地样本:5287

4-运行字段样本

4-端字段样本:4849

4-运行线程本地样本

4-端线程本地样本:5309

5-运行字段样本

5-端字段样本:4781

5-运行线程本地样本

5-端线程本地样本:5330

6-运行字段样本

6-端字段样本:5294

6-运行线程本地样本

6-端线程本地样本:5511

7-运行字段样本

7-端字段样本:5119

7-运行线程本地样本

7-端线程本地样本:5793

8-运行字段样本

8-端字段样本:4977

8-运行线程本地样本

8-端线程本地样本:6374

9-运行字段样本

9-端字段样本:4841

9-运行线程本地样本

9-结束线程本地样本:5471

字段平均值:5051

线程本地平均值:5664

环境:

openjdk 版本“1.8.0_131”

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4

Ubuntu 16.04 LTS

Here it goes another test. The results shows that ThreadLocal is a bit slower than a regular field, but in the same order. Aprox 12% slower

public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;

public static void main(String[] args) throws InterruptedException {
    int execs = 10;
    for (int i = 0; i < execs; i++) {
        new FieldExample().run(i);
        new ThreadLocaldExample().run(i);
    }
    System.out.println("Field avg:"+(fieldExecTime / execs));
    System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}

private static class FieldExample {
    private Map<String,String> map = new HashMap<String, String>();

    public void run(int z) {
        System.out.println(z+"-Running  field sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            map.put(s,"a");
            map.remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        fieldExecTime += t;
        System.out.println(z+"-End field sample:"+t);
    }
}

private static class ThreadLocaldExample{
    private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
        @Override protected Map<String, String> initialValue() {
            return new HashMap<String, String>();
        }
    };

    public void run(int z) {
        System.out.println(z+"-Running thread local sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            myThreadLocal.get().put(s, "a");
            myThreadLocal.get().remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        threadLocalExecTime += t;
        System.out.println(z+"-End thread local sample:"+t);
    }
}
}'

Output:

0-Running field sample

0-End field sample:6044

0-Running thread local sample

0-End thread local sample:6015

1-Running field sample

1-End field sample:5095

1-Running thread local sample

1-End thread local sample:5720

2-Running field sample

2-End field sample:4842

2-Running thread local sample

2-End thread local sample:5835

3-Running field sample

3-End field sample:4674

3-Running thread local sample

3-End thread local sample:5287

4-Running field sample

4-End field sample:4849

4-Running thread local sample

4-End thread local sample:5309

5-Running field sample

5-End field sample:4781

5-Running thread local sample

5-End thread local sample:5330

6-Running field sample

6-End field sample:5294

6-Running thread local sample

6-End thread local sample:5511

7-Running field sample

7-End field sample:5119

7-Running thread local sample

7-End thread local sample:5793

8-Running field sample

8-End field sample:4977

8-Running thread local sample

8-End thread local sample:6374

9-Running field sample

9-End field sample:4841

9-Running thread local sample

9-End thread local sample:5471

Field avg:5051

ThreadLocal avg:5664

Env:

openjdk version "1.8.0_131"

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4

Ubuntu 16.04 LTS

过期以后 2024-07-21 22:07:43

构建并测量它。

另外,如果将消息摘要行为封装到对象中,则只需要一个线程本地。 如果出于某种目的需要本地 MessageDigest 和本地 byte[1000],请创建一个带有 messageDigest 和 byte[] 字段的对象,并将该对象放入 ThreadLocal 中,而不是单独将两者放入。

Build it and measure it.

Also, you only need one threadlocal if you encapsulate your message digesting behaviour into an object. If you need a local MessageDigest and a local byte[1000] for some purpose, create an object with a messageDigest and a byte[] field and put that object into the ThreadLocal rather than both individually.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文