C++ 套接字服务器 - 无法使 CPU 饱和
我使用 boost::asio 用 C++ 开发了一个迷你 HTTP 服务器,现在我正在使用多个客户端对其进行负载测试,但我一直无法接近 CPU 饱和的情况。 我正在 Amazon EC2 实例上进行测试,一个 cpu 的使用率约为 50%,另一个 20%,其余两个处于空闲状态(根据 htop)。
详细信息:
- 服务器为每个核心启动一个线程
- 请求被接收、解析、处理,并写出响应
- 请求是针对数据的,该数据是从内存中读取的(对于此测试是只读的)
- 我正在“加载”服务器使用两台机器,每台运行一个java应用程序,运行25个线程,发送请求
- 我看到大约230个请求/秒吞吐量(这是应用程序请求,它由许多HTTP请求组成)
所以,我应该注意什么来改善这个结果? 鉴于 CPU 大部分时间处于空闲状态,我想利用额外的容量来获得更高的吞吐量,例如 800 个请求/秒或其他。
我的想法:
- 请求非常小,通常在几毫秒内完成,我可以修改客户端以发送/撰写更大的请求(也许使用批处理)
- 我可以修改HTTP服务器以使用选择设计模式,是这在这里合适吗?
- 我可以做一些分析来尝试了解瓶颈是什么
I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).
Details:
- The server fires up one thread per core
- Requests are received, parsed, processed, and responses are written out
- The requests are for data, which is read out of memory (read-only for this test)
- I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
- I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)
So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.
Ideas I've had:
- The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
- I could modify the HTTP server to use the Select design pattern, is this appropriate here?
- I could do some profiling to try to understand what the bottleneck's are/is
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对于如此简单的异步请求,230 个请求/秒似乎非常低。 因此,使用多个线程可能是不成熟的优化 - 让它正常工作并在单个线程中进行调整,然后看看您是否仍然需要它们。 只要摆脱不需要的锁定就可以加快速度。
本文对 Web 服务器式性能的 I/O 策略进行了一些详细介绍和讨论大约 2003 年。有人有更新的吗?
230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.
This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?
ASIO 适合中小型任务,但它不太擅长利用底层系统的强大功能。 原始套接字调用甚至 Windows 上的 IOCP 都不是,但如果您有经验,您将永远比 ASIO 更好。 无论哪种方式,所有这些方法都会产生大量开销,而 ASIO 的开销更大。
为了它的价值。 在我的自定义 HTTP 上使用原始套接字调用可以使用 4 核 I7 每秒处理 800K 动态请求。 它由 RAM 提供服务,这是您需要达到该级别性能的地方。 在此性能水平下,网络驱动程序和操作系统消耗大约 40% 的 CPU。 使用 ASIO,我每秒可以收到大约 50 到 100K 请求,它的性能变化很大,并且大部分都限制在我的应用程序中。 @cmeerw 的帖子主要解释了原因。
提高性能的一种方法是实施 UDP 代理。 拦截 HTTP 请求,然后通过 UDP 将它们路由到后端 UDP-HTTP 服务器,您可以绕过操作系统堆栈中的大量 TCP 开销。 您还可以拥有通过 UDP 本身进行管道传输的前端,这对于您自己来说应该不难。 HTTP-UDP 代理的优点是它允许您使用任何好的前端而无需修改,并且您可以随意交换它们而不会产生任何影响。 您只需要多几台服务器即可实现它。 对我的示例进行的此修改将操作系统 CPU 使用率降低到 10%,这将单个后端上每秒的请求数增加到略高于一百万。 FWIW 您应该始终为任何高性能站点设置一个前端-后端设置,因为前端可以缓存数据,而不会减慢更重要的动态请求后端。
未来似乎是编写您自己的驱动程序来实现自己的网络堆栈,这样您就可以尽可能接近请求并在那里实现您自己的协议。 这可能不是大多数程序员想听到的,因为它更复杂。 就我而言,我将能够多使用 40% 的 CPU,并达到每秒超过 100 万个动态请求。 UDP 代理方法可以让您接近最佳性能,而无需这样做,但是您将需要更多服务器 - 尽管如果您每秒执行这么多请求,您通常需要多个网卡和多个前端来处理带宽,因此几个轻量级 UDP 代理没什么大不了的。
希望其中一些对您有用。
ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.
For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by @cmeerw mostly explains why.
One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.
The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.
Hope some of this can be useful to you.
您有多少个 io_service 实例? Boost asio 有一个示例,它创建每个 CPU 一个 io_service 并以 RoundRobin 的方式使用它们。
您仍然可以创建四个线程并为每个 CPU 分配一个线程,但每个线程可以轮询自己的 io_service。
How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.
You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.
boost::asio 并不像您希望的那样对线程友好 - boost/asio/detail/epoll_reactor.hpp 中的 epoll 代码周围有一个大锁,这意味着一次只有一个线程可以调用内核的 epoll 系统调用。 对于非常小的请求,这会产生很大的差异(意味着您只会看到大致的单线程性能)。
请注意,这是 boost::asio 使用 Linux 内核设施的限制,不一定是 Linux 内核本身的限制。 当使用边缘触发事件时,epoll 系统调用确实支持多线程,但是正确实现它(没有过多的锁定)可能非常棘手。
顺便说一句,我一直在这个领域做一些工作(将完全多线程边缘触发的 epoll 事件循环与用户调度的线程/纤维相结合),并在 nginetd 项目。
boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).
Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.
BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.
当您使用 EC2 时,所有的赌注都会被取消。
使用真实的硬件进行尝试,然后您也许就能看到发生了什么。 尝试在虚拟机中进行性能测试基本上是不可能的。
我还没有弄清楚EC2有什么用,如果有人知道,请告诉我。
As you are using EC2, all bets are off.
Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.
I have not yet worked out what EC2 is useful for, if someone find out, please let me know.
根据您对网络利用率的评论,
您似乎没有太多网络活动。
3 + 2.5 MiB/sec
大约在50Mbps
左右(与您的 1Gbps 端口相比)。我想说您遇到以下两个问题之一,
查看
cmeerw
的注释和 CPU 利用率数据(怠速为
50% + 20% + 0% + 0%
)这似乎很可能是您的服务器实现中的限制。
我赞同
cmeerw
的回答 (+1)。From your comments on network utilization,
You do not seem to have much network movement.
3 + 2.5 MiB/sec
is around the50Mbps
ball-park (compared to your 1Gbps port).I'd say you are having one of the following two problems,
Looking at
cmeerw
's notes and your CPU utilization figures(idling at
50% + 20% + 0% + 0%
)it seems most likely a limitation in your server implementation.
I second
cmeerw
's answer (+1).