最小化 Java 线程上下文切换开销

发布于 2024-09-03 15:05:00 字数 324 浏览 7 评论 0原文

我有一个 Java 应用程序在 Sun 1.6 32 位 VM/Solaris 10 (x86)/Nahelem 8 核(每核 2 个线程)上运行。

应用程序中的一个特定用例是响应某些外部消息。在我的性能测试环境中,当我在接收外部输入的同一线程中准备和发送响应时,与将消息交给单独的线程来发送响应相比,我获得了大约 50 us 的优势。我使用带有 SynchronousQueue 的 ThreadPoolExecutor 来进行切换。

根据您的经验,将任务调度到线程池与获取执行之间的可接受的预期延迟是多少?过去有哪些想法对您有用,可以尝试改进这一点?

I have a Java application running on Sun 1.6 32-bit VM/Solaris 10 (x86)/Nahelem 8-core(2 threads per core).

A specific usecase in the application is to respond to some external message. In my performance test environment, when I prepare and send the response in the same thread that receives the external input, I get about 50 us advantage than when I hand off the message to a separate thread to send the response. I use a ThreadPoolExecutor with a SynchronousQueue to do the handoff.

In your experience what is the acceptable expected delay between scheduling a task to a thread pool and it getting picked up for execution? What ideas had worked for you in the past to try improve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一直在等你来 2024-09-10 15:05:00

“可接受的延迟”完全取决于您的应用程序。如果您有非常严格的延迟要求,那么在同一线程上处理所有内容确实会有所帮助。幸运的是,大多数应用程序没有那么严格的要求。

当然,如果只有一个线程能够接收请求,那么占用该线程来计算响应将意味着您无法接受任何其他请求。根据您正在做的事情,您可以使用异步 IO (等)来避免“每个请求线程”模型,但在我看来,这要困难得多,并且仍然会导致线程上下文切换。

有时,对请求进行排队以避免有太多线程处理它们是适当的:如果您的处理受 CPU 限制,那么拥有数百个线程没有多大意义 - 最好有一个任务的生产者/消费者队列并将它们分发到每个核心大约一个线程。当然,如果您设置正确,这基本上就是 ThreadPoolExecutor 将会做的事情。如果您的请求花费大量时间等待外部服务(包括磁盘,但主要是其他网络服务),那么这种方法就不太有效……那时您要么需要使用异步执行模型,要么每当您可能需要使用异步执行模型通过阻塞调用使核心空闲,或者您进行线程上下文切换并拥有大量线程,依靠线程调度程序使其工作得足够好。

最重要的是,延迟要求可能很严格 - 根据我的经验,它们比吞吐量要求要严格得多,因为它们更难以横向扩展。但这确实取决于上下文。

The "acceptable delay" entirely depends on your application. Dealing with everything on the same thread can indeed help if you've got very strict latency requirements. Fortunately most applications don't have requirements quite that strict.

Of course, if only one thread is able to receive requests, then tying up that thread for computing the response will mean you can't accept any other requests. Depending on what you're doing you can use asynchronous IO (etc) to avoid the "thread per request" model, but it's significantly harder IMO, and still ends up with thread context switching.

Sometimes it's appropriate to queue requests to avoid having too many threads processing them: if your handling is CPU-bound, it doesn't make much sense to have hundreds of threads - better to have a producer/consumer queue of tasks and distribute them at roughly one thread per core. That's basically what ThreadPoolExecutor will do if you set it up properly of course. That doesn't work as well if your requests spend a lot of their time waiting for external services (including disks, but primarily other network services)... at that point you either need to use asynchronous execution models whenever you would potentially make a core idle with a blocking call, or you take the thread context switching hit and have lots of threads, relying on the thread scheduler to make it work well enough.

The bottom line is that latency requirements can be tough - in my experience they're significantly tougher than throughput requirements, as they're much harder to scale out. It really does depend on the context though.

椒妓 2024-09-10 15:05:00

对于切换来说 50us 听起来有点高,IME (Solaris 10/Opteron) LBQ 通常在 30-35us 范围内,而 LTQ (LinkedTransferQueue) 大约比这快 5us。正如其他回复中所述,SynchronousQueue 可能会稍微慢一些,因为在另一个线程占用之前优惠不会返回。

根据我的结果,Solaris 10 明显比 Linux 慢,时间 <10us。

这实际上取决于一些事情,在峰值负载下,

  • 您每秒处理多少个请求?
  • 处理请求通常需要多长时间?

如果您知道这些问题的答案,那么从性能角度来看,您应该在接收线程中进行处理还是切换到处理线程,这应该是相当清楚的。

50us sounds somewhat high for a handoff, IME (Solaris 10/Opteron) LBQ is typically in the 30-35us range while LTQ (LinkedTransferQueue) is about 5us faster than that. As stated in the other replies SynchronousQueue may tend to be slightly slower because the offer doesn't return until the other thread has taken.

According to my results Solaris 10 is markedly slower than Linux at this which sees times <10us.

It really depends on a few things, under peak load

  • how many requests per second are you servicing?
  • how long does it typically take to process a request?

If you know the answer to those Qs then it should be fairly clear, on performance grounds, whether you should handle in the receiving thread or handoff to a processing thread.

淡笑忘祈一世凡恋 2024-09-10 15:05:00

您是否有理由不使用 LinkedBlockingQueue 以便您的生产者可以对几个项目而不是 SynchronousQueue 进行排队?至少有一个包含 1 个项目的队列,这样您可以获得更好的并行性。

“准备”过程与“响应”过程的速度是多少?如果它们太昂贵,您可以使用线程池让多个线程处理响应吗?

Is there a reason why you don't use a LinkedBlockingQueue so your producer can queue up a couple of items instead of a SynchronousQueue? At the very least have a queue with 1 item in it so you can get better parallelism.

What is the speed of the "prepare" process versus the "response"? Can you use a thread pool to have multiple threads handling the responses if they are too expensive?

錯遇了你 2024-09-10 15:05:00

不是相同的任务,但是“是” - 队列通常用于时间关键的任务。我们集中精力避免同步处理事件。查看以下提示

  • 不要使用同步容器(数组、列表、映射...)。考虑每个线程容器。
  • 我们使用了循环线程池。该池由预先分配的线程组成,并且(!)恰好有一个事件监听出现,没有任何队列。当事件引发时,线程将从循环中删除,另一个线程成为侦听器。处理完成后,线程返回到池中。

Not the same task, but "yes" - queue is to general to be used in time critical tasks. We have concentrated to avoid synchronization to handle events at all. Review following hints

  • Don't use synchronized containers (arrays, lists, maps...). Think about container-per-thread.
  • We have used round-robin pool of threads. This pool consist of pre-allocated threads and(!) exactly one listen for event appear without any queue. When event raised, thread is deleted from round-robin and another one became listener. When handling is accomplished thread returned to pool.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文