为什么 ByteBuffer.allocate() 和 ByteBuffer.allocateDirect() 之间存在奇怪的性能曲线差异

发布于 2024-09-18 14:39:59 字数 3272 浏览 3 评论 0原文

我正在编写一些SocketChannelSocketChannel代码,这些代码最适合使用直接字节缓冲区——寿命长且大(每个连接数十到数百兆字节)。 )在使用 FileChannel 哈希出确切的循环结构时,我对 ByteBuffer.allocate()ByteBuffer.allocateDirect() 运行了一些微基准测试代码>性能。

结果令人惊讶,我无法真正解释。在下图中,ByteBuffer.allocate() 传输实现在 256KB 和 512KB 处有一个非常明显的悬崖——性能下降了约 50%! ByteBuffer.allocateDirect() 的性能悬崖似乎也较小。 (增益百分比系列有助于可视化这些变化。)

缓冲区大小(字节)与时间 (MS)

The Pony Gap

为什么 ByteBuffer.allocate()ByteBuffer.allocateDirect() 之间存在奇怪的性能曲线差异?幕后究竟发生了什么?

这很可能取决于硬件和操作系统,所以以下是这些详细信息:

  • MacBook Pro w/ Dual-core Core 2 CPU
  • Intel X25M SSD Drive
  • OSX 10.6.4

Source code, by request:

package ch.dietpizza.bench;

import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;

public class SocketChannelByteBufferExample {
    private static WritableByteChannel target;
    private static ReadableByteChannel source;
    private static ByteBuffer          buffer;

    public static void main(String[] args) throws IOException, InterruptedException {
        long timeDirect;
        long normal;
        out.println("start");

        for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
            buffer = allocateDirect(i);
            timeDirect = copyShortest();

            buffer = allocate(i);
            normal = copyShortest();

            out.println(format("%d, %d, %d", i, normal, timeDirect));
        }

        out.println("stop");
    }

    private static long copyShortest() throws IOException, InterruptedException {
        int result = 0;
        for (int i = 0; i < 100; i++) {
            int single = copyOnce();
            result = (i == 0) ? single : Math.min(result, single);
        }
        return result;
    }


    private static int copyOnce() throws IOException, InterruptedException {
        initialize();

        long start = System.currentTimeMillis();

        while (source.read(buffer)!= -1) {    
            buffer.flip();  
            target.write(buffer);
            buffer.clear();  //pos = 0, limit = capacity
        }

        long time = System.currentTimeMillis() - start;

        rest();

        return (int)time;
    }   


    private static void initialize() throws UnknownHostException, IOException {
        InputStream  is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
        OutputStream os = new FileOutputStream(new File("/dev/null"));

        target = Channels.newChannel(os);
        source = Channels.newChannel(is);
    }

    private static void rest() throws InterruptedException {
        System.gc();
        Thread.sleep(200);      
    }
}

I'm working on some SocketChannel-to-SocketChannel code which will do best with a direct byte buffer--long lived and large (tens to hundreds of megabytes per connection.) While hashing out the exact loop structure with FileChannels, I ran some micro-benchmarks on ByteBuffer.allocate() vs. ByteBuffer.allocateDirect() performance.

There was a surprise in the results that I can't really explain. In the below graph, there is a very pronounced cliff at the 256KB and 512KB for the ByteBuffer.allocate() transfer implementation--the performance drops by ~50%! There also seem sto be a smaller performance cliff for the ByteBuffer.allocateDirect(). (The %-gain series helps to visualize these changes.)

Buffer Size (bytes) versus Time (MS)

The Pony Gap

Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()? What exactly is going on behind the curtain?

It very well maybe hardware and OS dependent, so here are those details:

  • MacBook Pro w/ Dual-core Core 2 CPU
  • Intel X25M SSD drive
  • OSX 10.6.4

Source code, by request:

package ch.dietpizza.bench;

import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;

public class SocketChannelByteBufferExample {
    private static WritableByteChannel target;
    private static ReadableByteChannel source;
    private static ByteBuffer          buffer;

    public static void main(String[] args) throws IOException, InterruptedException {
        long timeDirect;
        long normal;
        out.println("start");

        for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
            buffer = allocateDirect(i);
            timeDirect = copyShortest();

            buffer = allocate(i);
            normal = copyShortest();

            out.println(format("%d, %d, %d", i, normal, timeDirect));
        }

        out.println("stop");
    }

    private static long copyShortest() throws IOException, InterruptedException {
        int result = 0;
        for (int i = 0; i < 100; i++) {
            int single = copyOnce();
            result = (i == 0) ? single : Math.min(result, single);
        }
        return result;
    }


    private static int copyOnce() throws IOException, InterruptedException {
        initialize();

        long start = System.currentTimeMillis();

        while (source.read(buffer)!= -1) {    
            buffer.flip();  
            target.write(buffer);
            buffer.clear();  //pos = 0, limit = capacity
        }

        long time = System.currentTimeMillis() - start;

        rest();

        return (int)time;
    }   


    private static void initialize() throws UnknownHostException, IOException {
        InputStream  is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
        OutputStream os = new FileOutputStream(new File("/dev/null"));

        target = Channels.newChannel(os);
        source = Channels.newChannel(is);
    }

    private static void rest() throws InterruptedException {
        System.gc();
        Thread.sleep(200);      
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无法言说的痛 2024-09-25 14:39:59

ByteBuffer 的工作原理以及为什么直接(字节)Buffer 是现在唯一真正有用的。

首先我有点惊讶这不是常识,但请忍受

直接字节缓冲区在java堆之外分配一个地址。

这是最重要的:所有操作系统(和本机 C)函数都可以利用该地址,而无需锁定堆上的对象并复制数据。关于复制的简短示例:为了通过 Socket.getOutputStream().write(byte[]) 发送任何数据,本机代码必须“锁定”byte[],将其复制到 java 堆之外,然后调用操作系统函数,例如发送。复制在堆栈上执行(对于较小的字节[])或通过 malloc/free 对于较大的字节执行。
DatagramSockets 没有什么不同,它们也进行复制 - 只不过它们被限制为 64KB 并在堆栈上分配,如果线程堆栈不够大或递归深度不够,甚至可以杀死进程。
注意:锁定会阻止 JVM/GC 在堆周围移动/重新分配对象

因此,引入 NIO 的想法是避免复制和大量流管道/间接。在数据到达目的地之前,通常有 3-4 个缓冲类型的流。 (波兰队以漂亮的射门扳平比分(!))
通过引入直接缓冲区,java 可以直接与 C 本机代码通信,无需任何锁定/复制。因此,sent 函数可以将缓冲区的地址添加到位置,性能与原生 C 大致相同。
这就是关于直接缓冲区的内容。

直接缓冲区的主要问题 - 它们分配成本高昂,释放成本高昂 而且使用起来相当麻烦,一点也不像 byte[]。

非直接缓冲区不提供直接缓冲区的真正本质 - 即直接桥接到本机/操作系统,相反,它们是轻量级的并且共享完全相同的 API - 甚至更多,它们可以包装 byte[]< /code> 甚至它们的支持数组也可用于直接操作 - 有什么不喜欢的呢?好吧,他们必须被复制!

那么 Sun/Oracle 如何处理非直接缓冲区,因为操作系统/本机无法使用它们 - 好吧,天真地。当使用非直接缓冲区时,必须创建直接计数器部分。该实现足够智能,可以使用 ThreadLocal 并通过 SoftReference* 缓存一些直接缓冲区,以避免高昂的创建成本。复制它们时会出现幼稚的部分 - 它每次都会尝试复制整个缓冲区(remaining())。

现在想象一下:512 KB 非直接缓冲区转到 64 KB 套接字缓冲区,套接字缓冲区不会占用超过其大小的空间。因此,第一次将 512 KB 从非直接复制到线程本地直接,但仅使用其中的 64 KB。下一次将复制 512-64 KB,但仅使用 64 KB,第三次将复制 512-64*2 KB,但仅使用 64 KB,依此类推...这是乐观的,总是套接字缓冲区将完全为空。因此,您不仅总共复制了 n KB,而是复制了 n × n ÷ m (n = 512,m = 16(套接字缓冲区剩余的平均空间)。

复制部分是所有非直接缓冲区的公共/抽象路径,因此实现永远不知道目标容量。复制会破坏缓存等等,减少内存带宽等。

*关于 SoftReference 缓存的说明:它取决于 GC 实现,并且体验可能会有所不同。 Sun 的 GC 使用空闲堆内存来确定 SoftRefence 的生命周期,这会在释放软引用时导致一些尴尬的行为 - 应用程序需要再次分配以前缓存的对象 - 即更多分配(直接 ByteBuffer 在堆中占据较小部分,因此至少它们不会影响额外的缓存垃圾,而是会受到影响)

我的经验法则 - 池直接缓冲区大小与套接字读/写缓冲区相同。操作系统绝不会复制超出必要的内容。

这个微基准测试主要是内存吞吐量测试,操作系统会将文件完全放在缓存中,因此它主要测试memcpy。一旦缓冲区耗尽二级缓存,性能就会明显下降。同样运行这样的基准测试会增加和累积 GC 收集成本。 (rest() 不会收集软引用的 ByteBuffers)

How ByteBuffer works and why Direct (Byte)Buffers are the only truly useful now.

first I am a bit surprised it's not common knowledge but bear it w/ me

Direct byte buffers allocate an address outside the java heap.

This is utmost importance: all OS (and native C) functions can utilize that address w/o locking the object on the heap and copying the data. Short example on copying: in order to send any data via Socket.getOutputStream().write(byte[]) the native code has to "lock" the byte[], copy it outside java heap and then call the OS function, e.g. send. The copy is performed either on the stack (for smaller byte[]) or via malloc/free for larger ones.
DatagramSockets are no different and they also copy - except they are limited to 64KB and allocated on the stack which can even kill the process if the thread stack is not large enough or deep in recursion.
note: locking prevents JVM/GC to move/reallocate the object around the heap

So w/ the introduction of NIO the idea was avoid the copy and multitudes of stream pipelining/indirection. Often there are 3-4 buffered type of streams before the data reaches its destination. (yay Poland equalizes(!) with a beautiful shot)
By introducing the direct buffers java could communicate straight to C native code w/o any locking/copy necessary. Hence sent function can take the address of the buffer add the position and the performance is much the same as native C.
That's about the direct buffer.

The main issue w/ direct buffers - they are expensive to allocate and expensive to deallocate and quite cumbersome to use, nothing like byte[].

Non-direct buffer do not offer the true essence the direct buffers do - i.e. direct bridge to the native/OS instead they are light-weighted and share exactly the same API - and even more, they can wrap byte[] and even their backing array is available for direct manipulation - what not to love? Well they have to be copied!

So how does Sun/Oracle handles non-direct buffers as the OS/native can't use 'em - well, naively. When a non-direct buffer is used a direct counter part has to be created. The implementation is smart enough to use ThreadLocal and cache a few direct buffers via SoftReference* to avoid the hefty cost of creation. The naive part comes when copying them - it attempts to copy the entire buffer (remaining()) each time.

Now imagine: 512 KB non-direct buffer going to 64 KB socket buffer, the socket buffer won't take more than its size. So the 1st time 512 KB will be copied from non-direct to thread-local-direct, but only 64 KB of which will be used. The next time 512-64 KB will be copied but only 64 KB used, and the third time 512-64*2 KB will be copied but only 64 KB will be used, and so on... and that's optimistic that always the socket buffer will be empty entirely. So you are not only copying n KB in total, but n × n ÷ m (n = 512, m = 16 (the average space the socket buffer has left)).

The copying part is a common/abstract path to all non-direct buffer, so the implementation never knows the target capacity. Copying trashes the caches and what not, reduces the memory bandwidth, etc.

*A note on SoftReference caching: it depends on the GC implementation and the experience can vary. Sun's GC uses the free heap memory to determine the lifespan of the SoftRefences which leads to some awkward behavior when they are freed - the application needs to allocated the previously cached objects again- i.e. more allocation (direct ByteBuffers take minor part in the heap, so at least they do not affect the extra cache trashing but get affected instead)

My rule of the thumb - a pooled direct buffer sized with the socket read/write buffer. The OS never copies more than necessary.

This micro-benchmark is mostly memory throughput test, the OS will have the file entirely in cache, so it mostly tests memcpy. Once the buffers run out of the L2 cache the drop of performance is to be noticeable. Also running the benchmark like that imposes increasing and accumulated GC collection costs. (rest() will not collect the soft-referenced ByteBuffers)

彼岸花似海 2024-09-25 14:39:59

线程本地分配缓冲区(TLAB)

我想知道测试期间线程本地分配缓冲区(TLAB)是否在256K左右。 TLAB 的使用优化了堆中的分配,因此 <=256K 的非直接分配速度很快。

通常的做法是为每个线程提供一个缓冲区,供该线程专用来进行分配。您必须使用一些同步来从堆中分配缓冲区,但之后线程可以在不同步的情况下从缓冲区中分配。在热点 JVM 中,我们将这些称为线程本地分配缓冲区 (TLAB)。他们工作得很好。

绕过 TLAB 的大量分配

如果我关于 256K TLAB 的假设是正确的,那么本文后面的信息表明,较大的非直接缓冲区的 >256K 分配可能会绕过 TLAB。这些分配直接进入堆,需要线程同步,从而导致性能下降。

无法从 TLAB 进行分配并不总是意味着线程必须获取新的 TLAB。根据分配的大小和 TLAB 中剩余的未使用空间,VM 可以决定仅从堆中进行分配。从堆中进行的分配需要同步,但获取新的 TLAB 也需要同步。 如果分配被认为很大(当前 TLAB 大小的重要部分),则分配将始终在堆外完成。这减少了浪费并妥善处理远远大于平均分配。

调整 TLAB 参数

可以使用后面文章中的信息来测试该假设,该文章指示如何调整 TLAB 并获取诊断信息:

要试验特定的 TLAB 大小,需要两个 -XX 标志
要设置,一项用于定义初始大小,一项用于禁用
调整大小:

<前><代码>-XX:TLABSize= -XX:-ResizeTLAB

tlab 的最小大小通过 -XX:MinTLABSize 设置
默认为 2K 字节。最大尺寸就是最大尺寸
Java 整数数组,用于填充未分配的空间
发生 GC 清除时 TLAB 的一部分。

诊断打印选项

<前><代码>-XX:+PrintTLAB

在每次清理时为每个线程打印一行(以“TLAB: gc thread:”开头,不带“)和一个摘要行。

Thread Local Allocation Buffers (TLAB)

I wonder if the thread local allocation buffer (TLAB) during the test is around 256K. Use of TLABs optimizes allocations from the heap so that the non-direct allocations of <=256K are fast.

What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well.

Large allocations bypassing the TLAB

If my hypothesis about a 256K TLAB is correct, then information later in the the article suggests that perhaps the >256K allocations for the larger non-direct buffers bypass the TLAB. These allocations go straight to heap, requiring thread synchronization, thus incurring the performance hits.

An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.

Tweaking the TLAB parameters

This hypothesis could be tested using information from a later article indicating how to tweak the TLAB and get diagnostic info:

To experiment with a specific TLAB size, two -XX flags need
to be set, one to define the initial size, and one to disable
the resizing:

-XX:TLABSize= -XX:-ResizeTLAB

The minimum size of a tlab is set with -XX:MinTLABSize which
defaults to 2K bytes. The maximum size is the maximum size
of an integer Java array, which is used to fill the unallocated
portion of a TLAB when a GC scavenge occurs.

Diagnostic Printing Options

-XX:+PrintTLAB

Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: " without the "'s) and one summary line.

昔梦 2024-09-25 14:39:59

我怀疑这些膝盖是由于跨过 CPU 缓存边界造成的。与“直接”缓冲区 read()/write() 实现相比,由于额外的内存缓冲区副本,“非直接”缓冲区 read()/write() 实现“缓存未命中”更早。

I suspect that these knees are due to tripping across a CPU cache boundary. The "non-direct" buffer read()/write() implementation "cache misses" earlier due to the additional memory buffer copy compared to the "direct" buffer read()/write() implementation.

木森分化 2024-09-25 14:39:59

发生这种情况的原因有很多。如果没有代码和/或有关数据的更多详细信息,我们只能猜测发生了什么。

一些猜测:

  • 也许您达到了一次可以读取的最大字节数,因此 IOwait 变得更高或内存消耗增加,而循环却没有减少。
  • 也许您达到了临界内存限制,或者 JVM 正在尝试在新分配之前释放内存。尝试使用 -Xmx-Xms 参数
  • 也许 HotSpot 无法/不会优化,因为对某些方法的调用次数太少。
  • 也许有操作系统或硬件条件导致这种延迟,
  • 也许 JVM 的实现只是有问题;-)

There are many reasons why this could happen. Without code and/or more details about the data, we can only guess what is happening.

Some Guesses:

  • Maybe you hit the maximum bytes that can be read at a time, thus the IOwaits gets higher or memory consumption up without a decrease in loops.
  • Maybe you hit a critical memory limit, or the JVM is trying to free memory before a new allocation. Try playing around with the -Xmx and -Xms parameters
  • Maybe HotSpot can't/won't optimize, because the number of calls to some methods are too low.
  • Maybe there are OS or Hardware conditions that cause this kind of delay
  • Maybe the implementation of the JVM is just buggy ;-)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文