意外的套接字 CPU 使用率
我遇到了一个我不明白的性能问题。我正在开发的系统有两个线程,如下所示:
版本 A:
- 线程 1:数据处理 ->数据选择->数据格式化->先进先出
- 线程2:先进先出-> Socket
其中“选择”会减少数据,线程 1 末尾的 FIFO 是线程 2 开头的 FIFO(这些 FIFO 实际上是 TBB 并发队列)。出于性能原因,我将线程更改为如下所示:
版本 B:
- 线程 1:数据处理 ->数据选择->先进先出
- 线程2:先进先出->数据格式化-> Socket
最初,这种优化被证明是成功的。线程 1 能够实现更高的吞吐量。我没有太仔细地关注线程 2 的性能,因为我预计 CPU 使用率会更高,并且(由于数据稀疏)这不是一个主要问题。然而,我的一位同事要求对版本 A 和版本 B 进行性能比较。为了测试设置,我将线程 2 的套接字(boost asio tcp 套接字)写入同一个盒子 (127.0.0.1) 上的 iperf 实例:显示最大吞吐量的目标。
为了比较这两种设置,我首先尝试强制系统以 500 Mbps 的速度从套接字写入数据。作为性能测试的一部分,我监控了 top。我所看到的让我感到惊讶。版本 A 没有出现在“top -H”上,iperf 也没有出现(这实际上正如怀疑的那样)。然而,版本 B(我的“增强版本”)显示在“top -H”上,CPU 利用率约为 10%,而(奇怪的是)iperf 显示为 8%。
显然,这对我来说意味着我做错了什么。我似乎无法证明我是!我已经确认的事情:
- 两个版本都为套接字提供了 32k 数据块
- 两个版本都使用相同的 boost 库 (1.45)
- 两者都有相同的优化设置 (-O3)
- 两者都接收完全相同的数据,写出相同的数据,并以相同的速率写入。
- 两者都使用相同的阻塞写入调用。
- 我正在使用完全相同的设置(Red Hat)从同一个盒子进行测试
- 线程 2 的“格式化”部分不是问题(我将其删除并重现了问题)
- 网络上的小数据包这不是问题(我正在使用 TCP_CORK,并且我已通过wireshark 确认 TCP 段都是 ~16k)。
- 在套接字写入后立即进行 1 毫秒睡眠会使套接字线程和 iperf(?!) 上的 CPU 使用率恢复到 0%。
- 穷人的分析器揭示的信息非常少(套接字线程几乎总是处于睡眠状态)。
- Callgrind 透露的信息非常少(套接字几乎不写入寄存器)
- 将 iperf 切换为 netcat(写入 /dev/null)不会改变任何内容(实际上 netcat 的 cpu 使用率约为 20%)。
我唯一能想到的是我在套接字写入周围引入了更紧密的循环。但是,在 500 Mbps 时,我不会期望我的进程和 iperf 上的 CPU 使用率都会增加?
我不知道为什么会发生这种情况。我和我的同事基本上没有想法。有什么想法或建议吗?此时我会很乐意尝试任何事情。
I'm having a performance issue that I don't understand. The system I'm working on has two threads that look something like this:
Version A:
- Thread 1: Data Processing -> Data Selection -> Data Formatting -> FIFO
- Thread 2: FIFO -> Socket
Where 'Selection' thins down the data and the FIFO at the end of thread 1 is the FIFO at the beginning of thread 2 (the FIFOs are actually TBB Concurrent Queues). For performance reasons, I've altered the threads to look like this:
Version B:
- Thread 1: Data Processing -> Data Selection -> FIFO
- Thread 2: FIFO -> Data Formatting -> Socket
Initially, this optimization proved to be successful. Thread 1 is capable of much higher throughput. I didn't look too hard at Thread 2's performance because I expected the CPU usage would be higher and (due to data thinning) it wasn't a major concern. However, one of my colleagues asked for a performance comparison of version A and version B. To test the setup I had thread 2's socket (a boost asio tcp socket) write to an instance of iperf on the same box (127.0.0.1) with the goal of showing the maximum throughput.
To compare the two set ups I first tried forcing the system to write data out of the socket at 500 Mbps. As part of the performance testing I monitored top. What I saw surprised me. Version A did not show up on 'top -H' nor did iperf (this was actually as suspected). However, version B (my 'enhanced version') was showing up on 'top -H' with ~10% cpu utilization and (oddly) iperf was showing up with 8%.
Obviously, that implied to me that I was doing something wrong. I can't seem to prove that I am though! Things I've confirmed:
- Both versions are giving the socket 32k chunks of data
- Both versions are using the same boost library (1.45)
- Both have the same optimization setting (-O3)
- Both receive the exact same data, write out the same data, and write it at the same rate.
- Both use the same blocking write call.
- I'm testing from the same box with the exact same setup (Red Hat)
- The 'formatting' part of thread 2 is not the issue (I removed it and reproduced the problem)
- Small packets across the network is not the issue (I'm using TCP_CORK and I've confirmed via wireshark that the TCP Segments are all ~16k).
- Putting a 1 ms sleep right after the socket write makes the CPU usage on both the socket thread and iperf(?!) go back to 0%.
- Poor man's profiler reveals very little (the socket thread is almost always sleeping).
- Callgrind reveals very little (the socket write barely even registers)
- Switching iperf for netcat (writing to /dev/null) doesn't change anything (actually netcat's cpu usage was ~20%).
The only thing I can think of is that I've introduced a tighter loop around the socket write. However, at 500 Mbps I wouldn't expect that the cpu usage on both my process and iperf would be increased?
I'm at a loss to why this is happening. My coworkers and I are basically out of ideas. Any thoughts or suggestions? I'll happily try anything at this point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果没有代码片段或实际数据量,这将很难分析。
我想到的一件事是:如果预格式化的数据流明显大于后格式化的数据流,您可能会花费更多的带宽/周期来通过 FIFO(套接字)边界复制更多的数据。
尝试估计或测量每个阶段的数据速率。如果“选择”输出处的数据速率较高,请考虑将格式移至边界另一侧的影响。是否有可能配置A中的select->format转换不需要副本,而配置B强加了很多副本?
...只是猜测,没有对系统有更多的了解。
This is going to be very hard to analyze without code snippets or actual data quantities.
One thing that comes to mind: if the pre-formatted data stream is significantly larger than post-format, you may be expending more bandwidth/cycles copying a bunch more data through the FIFO (socket) boundary.
Try estimating or measuring the data rate at each stage. If the data rate is higher at the output of 'selection', consider the effects of moving formatting to the other side of the boundary. Is it possible that no copy is required for the select->format transition in configuration A, and configuration B imposes lots of copies?
... just a guesses without more insight into the system.
如果 FIFO 是版本 A 中的瓶颈怎么办?那么两个线程大部分时间都会等待 FIFO。在版本 B 中,您可以更快地将数据交给
iperf
。What if the FIFO was the bottleneck in version A. Then both threads would sit and wait for the FIFO most of the time. And in version B, you'd be handing the data off to
iperf
faster.FIFO 队列中到底存储了什么?您是否存储数据包(即缓冲区)?
在版本 A 中,您将格式化数据(可能是字节)写入队列。因此,在套接字上发送它只需要写出固定大小的缓冲区。
然而在版本 B 中,您将高级数据存储在队列中。现在对其进行格式化会创建更大的缓冲区大小,这些缓冲区大小将直接写入套接字。这会导致 TCP/ip 堆栈将 CPU 周期花费在分段和分段上。开销...
这是我的理论,基于您到目前为止所说的内容。
What exactly do you store in the FIFO queues? Do you store packets of data i.e buffers?
In version A, you were writing formatted data (probably bytes) to the queue. So, sending it on the socket involved just writing out a fixed size buffer.
However in version B, you are storing high level data in the queues. Formatting it is now creating bigger buffer sizes that are being written directly to the socket. This causes the TCp/ip stack to spend CPU cycles in fragmenting & overhead...
THis is my theory based on what you have said so far.