C++ 套接字服务器 - 无法使 CPU 饱和

发布于 2024-07-30 04:15:22 字数 627 浏览 6 评论 0原文

我使用 boost::asio 用 C++ 开发了一个迷你 HTTP 服务器，现在我正在使用多个客户端对其进行负载测试，但我一直无法接近 CPU 饱和的情况。我正在 Amazon EC2 实例上进行测试，一个 cpu 的使用率约为 50%，另一个 20%，其余两个处于空闲状态（根据 htop）。

详细信息：

服务器为每个核心启动一个线程
请求被接收、解析、处理，并写出响应
请求是针对数据的，该数据是从内存中读取的（对于此测试是只读的）
我正在“加载”服务器使用两台机器，每台运行一个java应用程序，运行25个线程，发送请求
我看到大约230个请求/秒吞吐量（这是应用程序请求，它由许多HTTP请求组成）

所以，我应该注意什么来改善这个结果？鉴于 CPU 大部分时间处于空闲状态，我想利用额外的容量来获得更高的吞吐量，例如 800 个请求/秒或其他。

我的想法：

请求非常小，通常在几毫秒内完成，我可以修改客户端以发送/撰写更大的请求（也许使用批处理）
我可以修改HTTP服务器以使用选择设计模式，是这在这里合适吗？
我可以做一些分析来尝试了解瓶颈是什么

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泛滥成性 2024-08-06 04:15:23

对于如此简单的异步请求，230 个请求/秒似乎非常低。因此，使用多个线程可能是不成熟的优化 - 让它正常工作并在单个线程中进行调整，然后看看您是否仍然需要它们。只要摆脱不需要的锁定就可以加快速度。

本文对 Web 服务器式性能的 I/O 策略进行了一些详细介绍和讨论大约 2003 年。有人有更新的吗？

回复收藏 0 原文

つ低調成傷 2024-08-06 04:15:23

ASIO 适合中小型任务，但它不太擅长利用底层系统的强大功能。原始套接字调用甚至 Windows 上的 IOCP 都不是，但如果您有经验，您将永远比 ASIO 更好。无论哪种方式，所有这些方法都会产生大量开销，而 ASIO 的开销更大。

为了它的价值。在我的自定义 HTTP 上使用原始套接字调用可以使用 4 核 I7 每秒处理 800K 动态请求。它由 RAM 提供服务，这是您需要达到该级别性能的地方。在此性能水平下，网络驱动程序和操作系统消耗大约 40% 的 CPU。使用 ASIO，我每秒可以收到大约 50 到 100K 请求，它的性能变化很大，并且大部分都限制在我的应用程序中。 @cmeerw 的帖子主要解释了原因。

提高性能的一种方法是实施 UDP 代理。拦截 HTTP 请求，然后通过 UDP 将它们路由到后端 UDP-HTTP 服务器，您可以绕过操作系统堆栈中的大量 TCP 开销。您还可以拥有通过 UDP 本身进行管道传输的前端，这对于您自己来说应该不难。 HTTP-UDP 代理的优点是它允许您使用任何好的前端而无需修改，并且您可以随意交换它们而不会产生任何影响。您只需要多几台服务器即可实现它。对我的示例进行的此修改将操作系统 CPU 使用率降低到 10%，这将单个后端上每秒的请求数增加到略高于一百万。 FWIW 您应该始终为任何高性能站点设置一个前端-后端设置，因为前端可以缓存数据，而不会减慢更重要的动态请求后端。

未来似乎是编写您自己的驱动程序来实现自己的网络堆栈，这样您就可以尽可能接近请求并在那里实现您自己的协议。这可能不是大多数程序员想听到的，因为它更复杂。就我而言，我将能够多使用 40% 的 CPU，并达到每秒超过 100 万个动态请求。 UDP 代理方法可以让您接近最佳性能，而无需这样做，但是您将需要更多服务器 - 尽管如果您每秒执行这么多请求，您通常需要多个网卡和多个前端来处理带宽，因此几个轻量级 UDP 代理没什么大不了的。

希望其中一些对您有用。

ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.

For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by @cmeerw mostly explains why.

One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.

The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.

Hope some of this can be useful to you.

回复收藏 0 原文

热鲨 2024-08-06 04:15:23

您有多少个 io_service 实例？ Boost asio 有一个示例，它创建每个 CPU 一个 io_service 并以 RoundRobin 的方式使用它们。

您仍然可以创建四个线程并为每个 CPU 分配一个线程，但每个线程可以轮询自己的 io_service。

回复收藏 0 原文

很快妥协 2024-08-06 04:15:22

boost::asio 并不像您希望的那样对线程友好 - boost/asio/detail/epoll_reactor.hpp 中的 epoll 代码周围有一个大锁，这意味着一次只有一个线程可以调用内核的 epoll 系统调用。对于非常小的请求，这会产生很大的差异（意味着您只会看到大致的单线程性能）。

请注意，这是 boost::asio 使用 Linux 内核设施的限制，不一定是 Linux 内核本身的限制。当使用边缘触发事件时，epoll 系统调用确实支持多线程，但是正确实现它（没有过多的锁定）可能非常棘手。

顺便说一句，我一直在这个领域做一些工作（将完全多线程边缘触发的 epoll 事件循环与用户调度的线程/纤维相结合），并在 nginetd 项目。

回复收藏 0 原文