在执行顺序(几乎)不变的情况下,分配变量如何会导致性能严重下降?

发布于 2024-10-31 20:11:16 字数 8007 浏览 2 评论 0原文

在使用多线程时,我可以观察到一些与 AtomicLong(以及使用它的类,例如 java.util.Random)相关的一些意外但严重的性能问题,目前我还没有任何解释。但是,我创建了一个简约的示例,它基本上由两个类组成:类“Container”,它保留对易失性变量的引用,以及类“DemoThread”,它在线程执行期间对“Container”的实例进行操作。请注意,对“Container”和 volatile long 的引用是私有的,并且永远不会在线程之间共享(我知道这里不需要使用 volatile,它只是用于演示目的) - 因此,“DemoThread”的多个实例应该完美运行在多处理器机器上并行,但由于某种原因,它们没有(完整的示例位于本文的底部)。

private static class Container  {

    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    public void run() {
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
    }
}

在我的测试过程中,我重复创建了 4 个 DemoThread,然后启动并连接它们。每个循环中唯一的区别是“prepare()”被调用的时间(这显然是线程运行所必需的,否则会导致 NullPointerException):

DemoThread[] threads = new DemoThread[numberOfThreads];
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
    }

出于某种原因,如果在之前立即执行prepare()启动线程,需要两倍的时间才能完成,即使没有“易失性”关键字,性能差异也很大,至少在我测试代码的两台机器和操作系统上是这样。以下是简短摘要:


Mac OS 摘要:

Java 版本:1.6.0_24
Java 类版本:50.0
VM 供应商:Sun Microsystems Inc.
虚拟机版本:19.1-b02-334
虚拟机名称:Java HotSpot(TM) 64 位服务器虚拟机
操作系统名称:Mac OS X
操作系统架构:x86_64
操作系统版本:10.6.5
处理器/内核:8 个

,带有 volatile 关键字:
最终结果:
31979 女士当实例化后调用prepare()时。
96482 女士当在执行之前调用prepare()时。

没有 volatile 关键字:
最终结果:
26009 女士当实例化后调用prepare()时。
35196 女士当在执行之前调用prepare()时。


Windows 摘要:

Java 版本:1.6.0_24
Java 类版本:50.0
VM 供应商:Sun Microsystems Inc.
虚拟机版本:19.1-b02
虚拟机名称:Java HotSpot(TM) 64 位服务器虚拟机
操作系统名称:Windows 7
操作系统架构:amd64
操作系统版本:6.1
处理器/内核:4 个

,带有 volatile 关键字:
最终结果:
18120 女士当实例化后调用prepare()时。
36089 女士当在执行之前调用prepare()时。

没有 volatile 关键字:
最终结果:
10115 女士当实例化后调用prepare()时。
10039 女士当在执行之前调用prepare()时。


Linux 摘要:

Java 版本:1.6.0_20
Java 类版本:50.0
VM 供应商:Sun Microsystems Inc.
虚拟机版本:19.0-b09
虚拟机名称:OpenJDK 64 位服务器虚拟机
操作系统名称:Linux
操作系统架构:amd64
操作系统版本:2.6.32-28-generic
处理器/内核:4 个

,带有 volatile 关键字:
最终结果:
45848 女士当实例化后调用prepare()时。
110754 女士当在执行之前调用prepare()时。

没有 volatile 关键字:
最终结果:
37862 女士当实例化后调用prepare()时。
39357 女士当在执行之前调用prepare()时。


Mac OS 详细信息(易失性):

测试 1、4 个线程,在创建循环中设置变量
线程 2 在 653 毫秒后完成。
线程 3 在 653 毫秒后完成。
线程 4 在 653 毫秒后完成。
线程 5 在 653 毫秒后完成。
总时间:654 毫秒。

测试 2、4 个线程,在启动循环中设置变量
线程 7 在 1588 毫秒后完成。
线程 6 在 1589 毫秒后完成。
线程 8 在 1593 毫秒后完成。
线程 9 在 1593 毫秒后完成。
总时间:1594 毫秒。

测试 3、4 个线程,在创建循环中设置变量
Thread-10 在 648 毫秒后完成。
线程 12 在 648 毫秒后完成。
线程 13 在 648 毫秒后完成。
线程 11 在 648 毫秒后完成。
总时间:648 毫秒。

测试 4、4 个线程,在启动循环中设置变量
Thread-17 在 1353 毫秒后完成。
Thread-16 在 1957 毫秒后完成。
Thread-14 在 2170 毫秒后完成。
Thread-15 在 2169 毫秒后完成。
总时间:2172 毫秒。

(依此类推,有时“慢”循环中的一两个线程会按预期完成,但大多数时候不会)。

给定的示例从理论上看是没有用的,并且这里不需要“易失性” - 但是,如果您使用“java.util.Random”实例而不是“容器”类并调用,例如,多次使用nextInt(),会发生相同的效果:如果在线程的构造函数中创建对象,线程将执行得很快,但如果在run()方法中创建对象,则执行速度会很慢。我相信 Mac OS 上的 Java 随机减速 中描述的性能问题超过一年前与这种效果有关,但我不知道为什么会这样 - 除此之外,我确信它不应该是这样的,因为这意味着在其中创建一个新对象总是很危险的线程的运行方法,除非您知道对象图中不会涉及任何易失性变量。分析没有帮助,因为在这种情况下问题消失了(与 中的观察结果相同) Mac OS 上的 Java 随机变慢(续)),并且它也不会发生在单核 PC 上 - 所以我猜这是一个线程同步问题......但是,奇怪的事情实际上没有什么需要同步的,因为所有变量都是线程本地的。

真的很期待任何提示 - 如果您想确认或伪造问题,请参阅下面的测试用例。

谢谢,

斯蒂芬

public class UnexpectedPerformanceIssue {

private static class Container  {

    // Remove the volatile keyword, and the problem disappears (on windows)
    // or gets smaller (on mac os)
    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    @Override
    public void run() {
        long start = System.nanoTime();
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
        long end = System.nanoTime();
        System.out.println(this.getName() + " completed after "
                +  ((end - start)/1000000) + " ms.");
    }
}

public static void main(String[] args) {
    System.out.println("Java Version: " + System.getProperty("java.version"));
    System.out.println("Java Class Version: " + System.getProperty("java.class.version"));

    System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
    System.out.println("VM Version: " + System.getProperty("java.vm.version"));
    System.out.println("VM Name: " + System.getProperty("java.vm.name"));

    System.out.println("OS Name: " + System.getProperty("os.name"));
    System.out.println("OS Arch: " + System.getProperty("os.arch"));
    System.out.println("OS Version: " + System.getProperty("os.version"));
    System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());

    System.out.println();
    int numberOfThreads = 4;

    System.out.println("\nReference Test (single thread):");
    DemoThread t = new DemoThread();
    t.prepare();
    t.run();

    DemoThread[] threads = new DemoThread[numberOfThreads];
    long createTime = 0, startTime = 0;
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        long overallStart = System.nanoTime();
        if(prepareAfterConstructor) {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");             
        } else {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
        }

        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            // Either call DemoThread.prepare() here (in odd loops)...
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            // or here (in even loops). Should make no difference, but does!
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
        long overallEnd = System.nanoTime();
        long overallTime = (overallEnd - overallStart);
        if(prepareAfterConstructor) {
            createTime += overallTime;
        } else {
            startTime += overallTime;
        }
        System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
    }
    System.out.println("Final results:");
    System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
    System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}

private static void joinThreads(Thread[] threads) {
    for(int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

}

When playing around with multithreading, I could observe some unexpected but serious performance issues related to AtomicLong (and classes using it, such as java.util.Random), for which I currently have no explanation. However, I created a minimalistic example, which basically consists of two classes: a class "Container", which keeps a reference to a volatile variable, and a class "DemoThread", which operates on an instance of "Container" during thread execution. Note that the references to "Container" and the volatile long are private, and never shared between threads (I know that there's no need to use volatile here, it's just for demonstration purposes) - thus, multiple instances of "DemoThread" should run perfectly parallel on a multiprocessor machine, but for some reason, they do not (Complete example is at the bottom of this post).

private static class Container  {

    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    public void run() {
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
    }
}

During my test, I repeatedly create 4 DemoThreads, which are then started and joined. The only difference in each loop is the time when "prepare()" gets called (which is obviously required for the thread to run, as it otherwise would result in a NullPointerException):

DemoThread[] threads = new DemoThread[numberOfThreads];
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
    }

For some reason, if prepare() is executed immediately before starting the thread, it will take twice as more time to finish, and even without the "volatile" keyword, the performance differences were significant, at least on two of the machines and OS'es I tested the code. Here's a short summary:


Mac OS Summary:

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02-334
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Mac OS X
OS Arch: x86_64
OS Version: 10.6.5
Processors/Cores: 8

With volatile keyword:
Final results:
31979 ms. when prepare() was called after instantiation.
96482 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
26009 ms. when prepare() was called after instantiation.
35196 ms. when prepare() was called before execution.


Windows Summary:

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Windows 7
OS Arch: amd64
OS Version: 6.1
Processors/Cores: 4

With volatile keyword:
Final results:
18120 ms. when prepare() was called after instantiation.
36089 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
10115 ms. when prepare() was called after instantiation.
10039 ms. when prepare() was called before execution.


Linux Summary:

Java Version: 1.6.0_20
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.0-b09
VM Name: OpenJDK 64-Bit Server VM
OS Name: Linux
OS Arch: amd64
OS Version: 2.6.32-28-generic
Processors/Cores: 4

With volatile keyword:
Final results:
45848 ms. when prepare() was called after instantiation.
110754 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
37862 ms. when prepare() was called after instantiation.
39357 ms. when prepare() was called before execution.


Mac OS Details (volatile):

Test 1, 4 threads, setting variable in creation loop
Thread-2 completed after 653 ms.
Thread-3 completed after 653 ms.
Thread-4 completed after 653 ms.
Thread-5 completed after 653 ms.
Overall time: 654 ms.

Test 2, 4 threads, setting variable in start loop
Thread-7 completed after 1588 ms.
Thread-6 completed after 1589 ms.
Thread-8 completed after 1593 ms.
Thread-9 completed after 1593 ms.
Overall time: 1594 ms.

Test 3, 4 threads, setting variable in creation loop
Thread-10 completed after 648 ms.
Thread-12 completed after 648 ms.
Thread-13 completed after 648 ms.
Thread-11 completed after 648 ms.
Overall time: 648 ms.

Test 4, 4 threads, setting variable in start loop
Thread-17 completed after 1353 ms.
Thread-16 completed after 1957 ms.
Thread-14 completed after 2170 ms.
Thread-15 completed after 2169 ms.
Overall time: 2172 ms.

(and so on, sometimes one or two of the threads in the 'slow' loop finish as expected, but most times they don't).

The given example looks theoretically, as it is of no use, and 'volatile' is not needed here - however, if you'd use a 'java.util.Random'-Instance instead of the 'Container'-Class and call, for instance, nextInt() multiple times, the same effects will occur: The thread will be executed fast if you create the object in the Thread's constructor, but slow if you create it within the run()-method. I believe that the performance issues described in Java Random Slowdowns on Mac OS more than a year ago are related to this effect, but I have no idea why it is as it is - besides that I'm sure that it shouldn't be like that, as it would mean that it's always dangerous to create a new object within the run-method of a thread, unless you know that no volatile variables will get involved within the object graph. Profiling doesn't help, as the problem disappears in this case (same observation as in Java Random Slowdowns on Mac OS cont'd), and it also does not happen on a single-core-PC - so I'd guess that it's kind of a thread synchronization problem... however, the strange thing is that there's actually nothing to synchronize, as all variables are thread-local.

Really looking forward for any hints - and just in case you want to confirm or falsify the problem, see the test case below.

Thanks,

Stephan

public class UnexpectedPerformanceIssue {

private static class Container  {

    // Remove the volatile keyword, and the problem disappears (on windows)
    // or gets smaller (on mac os)
    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    @Override
    public void run() {
        long start = System.nanoTime();
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
        long end = System.nanoTime();
        System.out.println(this.getName() + " completed after "
                +  ((end - start)/1000000) + " ms.");
    }
}

public static void main(String[] args) {
    System.out.println("Java Version: " + System.getProperty("java.version"));
    System.out.println("Java Class Version: " + System.getProperty("java.class.version"));

    System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
    System.out.println("VM Version: " + System.getProperty("java.vm.version"));
    System.out.println("VM Name: " + System.getProperty("java.vm.name"));

    System.out.println("OS Name: " + System.getProperty("os.name"));
    System.out.println("OS Arch: " + System.getProperty("os.arch"));
    System.out.println("OS Version: " + System.getProperty("os.version"));
    System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());

    System.out.println();
    int numberOfThreads = 4;

    System.out.println("\nReference Test (single thread):");
    DemoThread t = new DemoThread();
    t.prepare();
    t.run();

    DemoThread[] threads = new DemoThread[numberOfThreads];
    long createTime = 0, startTime = 0;
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        long overallStart = System.nanoTime();
        if(prepareAfterConstructor) {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");             
        } else {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
        }

        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            // Either call DemoThread.prepare() here (in odd loops)...
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            // or here (in even loops). Should make no difference, but does!
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
        long overallEnd = System.nanoTime();
        long overallTime = (overallEnd - overallStart);
        if(prepareAfterConstructor) {
            createTime += overallTime;
        } else {
            startTime += overallTime;
        }
        System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
    }
    System.out.println("Final results:");
    System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
    System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}

private static void joinThreads(Thread[] threads) {
    for(int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

请你别敷衍 2024-11-07 20:11:17

好吧,你正在写入一个易失性变量,所以我怀疑这会强制内存屏障 - 撤消一些原本可以实现的优化。 JVM 不知道该特定字段不会在另一个线程上被观察到。

编辑:如上所述,基准测试本身存在问题,例如在计时器运行时进行打印。此外,在开始计时之前“预热”JIT 通常是个好主意 - 否则您测量的时间在正常的长时间运行过程中并不重要。

Well, you're writing to a volatile variable, so I suspect that's forcing a memory barrier - undoing some optimization which can otherwise be achieved. The JVM doesn't know that that particular field isn't going to be observed on another thread.

EDIT: As noted, there are problems with the benchmark itself, such as printing while the timer is running. Also, it's usually a good idea to "warm up" the JIT before starting timing - otherwise you're measuring time which wouldn't be significant in a normal long-running process.

吲‖鸣 2024-11-07 20:11:17

我不是 Java 内部的专家,但我读了你的问题并发现它很有趣。如果我不得不猜测,我想你发现了什么:

  1. 与 volitale 属性的实例化没有任何关系。但是,从您的数据来看,实例化属性的位置会影响读/写它的成本。

  2. 与在运行时查找 volitale 属性的引用有关。也就是说,我有兴趣了解延迟如何随着更多循环更频繁的线程而增加。是对 volitale 属性的调用次数导致了延迟,还是添加本身,或者是新值的写入。

我不得不猜测:可能有一个 JVM 优化尝试快速实例化属性,然后,如果有时间,更改内存中的属性,以便更容易读/写它。也许有一个 (1) 快速创建 volitale 属性的读/写队列,以及一个 (2) 难以创建但快速调用的队列,并且 JVM 从 (1) 开始,然后更改 volitale 属性至(2)。

也许如果您在调用 run() 方法之前准备 prepare(),则 JVM 没有足够的空闲周期来优化 (1) 到 (2)。

测试这个答案的方法是:

prepare()、sleep()、run(),看看是否得到相同的延迟。如果睡眠是导致优化发生的唯一因素,那么这可能意味着 JVM 需要循环来优化 volitale 属性

,或者(风险更大一些)...

prepare() 和 run() 线程,稍后在循环中间,暂停()或睡眠()或以某种方式停止对该属性的访问,以便 JVM 可以尝试将其移动到(2)。

我很想看看你能发现什么。有趣的问题。

I am not an expert in the internals of Java, but I read your question and find it fascinating. If I had to guess, I think what you have discovered:

  1. Does NOT have anything to do with the instantiation of the volitale property. However, from your data, where the property gets instantiated affects how expensive it is to read/write to it.

  2. Does have to do with finding the reference of the volitale property at runtime. That is, I would be interested to see how the delay grows with more threads that loop more often. Is the number of calls to the volitale property what is causing the delay, or the addition itself, or the writing of the new value.

I would have to guess that: there is probably a JVM optimization that attempts to quickly instantiate the property, and later, if there is time, to alter the property in memory so it is easier to read/write to it. Maybe there is a (1) quick-to-create read/write queue for volitale properties, and a (2) hard-to-create but quick to call queue, and the JVM begins with (1) and later alters the volitale property to (2).

Perhaps if you prepare() right before the run() method gets called, the JVM does not have enough free cycles to optimize from (1) to (2).

The way to test this answer would be to:

prepare(), sleep(), run() and see if you get the same delay. If the sleep is the only thing that is causing for the optimization to take place, then it could mean the JVM needs cycles to optimize the volitale property

OR (a bit more risky) ...

prepare() and run() the threads, later in the middle of the loop, to either pause() or sleep() or somehow stop access to the property in a way that the JVM can attempt to move it to (2).

I'd be interested to see what you find out. Interesting question.

情痴 2024-11-07 20:11:17

嗯,我看到的最大区别在于对象的分配顺序。在构造函数之后准备时,容器分配与线程分配交错。在执行前准备时,首先分配所有线程,然后分配所有容器。

我不太了解多处理器环境中的内存问题,但如果我不得不猜测,也许在第二种情况下,容器分配更有可能分配在同一内存页中,并且处理器可能会变慢由于同一内存页的争用而导致停机?

[编辑] 按照这个思路,我很想看看如果您不尝试写回变量,而只在线程的 run 方法中读取变量,会发生什么。我希望时间差异会消失。

[edit2]参见irreputable的回答;他比我解释得好得多

Well, the big difference I see is in the order in which objects are allocated. When preparing after the constructor, your Container allocations are interleaved with your Thread allocations. When preparing prior to execution, your Threads are all allocated first, then your Containers are all allocated.

I don't know a whole lot about memory issues in multi-processor environments, but if I had to guess, maybe in the second case the Container allocations are more likely to be allocated in the same memory page, and perhaps the processors are slowed down due to contention for the same memory page?

[edit] Following this line of thought, I'd be interested to see what happens if you don't try to write back to the variable, and only read from it, in the Thread's run method. I would expect the timings difference to go away.

[edit2] See irreputable's answer; he explains it much better than I could

快乐很简单 2024-11-07 20:11:16

很可能两个易失性变量 ab 彼此太接近,它们落在同一个缓存行中;虽然CPU A 仅读取/写入变量a,CPU B 仅读取/写入变量b,但它们是仍然通过同一高速缓存线相互耦合。此类问题称为错误共享

在您的示例中,我们有两种分配方案:

new Thread                               new Thread
new Container               vs           new Thread
new Thread                               ....
new Container                            new Container
....                                     new Container

在第一个方案中,两个 volatile 变量不太可能彼此接近。在第二种方案中,几乎可以肯定是这种情况。

CPU 缓存不能处理单个字;相反,它们处理缓存行。高速缓存行是一块连续的内存块,例如 64 个相邻字节。通常这很好 - 如果 CPU 访问一个单元,它很可能也会访问相邻的单元。除了您的示例之外,该假设不仅无效,而且是有害的。

假设ab落在同一个缓存行L中。当CPU A 更新a 时,它会通知其他CPU L 已脏。由于 B 也缓存 L,因为它正在 b 上工作,所以 B 必须删除其缓存的 L。所以下次B需要读取b时,必须重新加载L,这是昂贵的。

如果 B 必须访问主内存来重新加载,那么成本极高,通常会慢 100 倍。

幸运的是,AB 可以直接就新值进行通信,而无需通过主内存。然而,这需要额外的时间。

为了验证这个理论,你可以在Container中额外填充128字节,这样两个Container的两个易失性变量就不会落在同一个缓存行中;那么您应该观察到这两个方案的执行时间大约相同。

经验教训:CPU 通常假设相邻变量是相关的。如果我们想要自变量,我们最好将它们放置得彼此远离。

It's likely that two volatile variables a and b are too close to each other, they fall in the same cache line; although CPU A only reads/writes variable a, and CPU B only reads/writes variable b, they are still coupled to each other through the same cache line. Such problems are called false sharing.

In your example, we have two allocation schemes:

new Thread                               new Thread
new Container               vs           new Thread
new Thread                               ....
new Container                            new Container
....                                     new Container

In the first scheme, it's very unlikely that two volatile variables are close to each other. In the 2nd scheme, it's almost certainly the case.

CPU caches don't work with individual words; instead, they deal with cache lines. A cache line is a continuous chunk of memory, say 64 neighboring bytes. Usually this is nice - if a CPU accessed a cell, it's very likely that it will access the neighboring cells too. Except in your example, that assumption is not only invalid, but detrimental.

Suppose a and b fall in the same cache line L. When CPU A updates a, it notifies other CPUs that L is dirty. Since B caches L too, because it's working on b, B must drop its cached L. So next time B needs to read b, it must reload L, which is costly.

If B must access main memory to reload, that is extremely costly, it's usually 100X slower.

Fortunately, A and B can communicate directly about the new values without going through main memory. Nevertheless it takes extra time.

To verify this theory, you can stuff extra 128 bytes in Container, so that two volatile variable of two Container will not fall in the same cache line; then you should observe that the two schemes take about the same time to execute.

Lession learned: usually CPUs assume that adjecent variables are related. If we want independent variables, we better place them far away from each other.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文