Windows 上套接字发送/接收的速度

发布于 2025-01-13 03:30:57 字数 951 浏览 0 评论 0原文

在 Windows + Python 3.7 + i5 笔记本电脑上，通过socket接收 100MB 数据需要 200ms，这与 RAM 速度相比显然非常低。

如何提高 Windows 上的套接字速度？

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
t0 = time.time()
while True:
    data = conn.recv(8192)  # 8192 instead of 1024 improves from 0.5s to 0.2s
    if data == b'':
        break
print(time.time() - t0)  # ~ 0.200s

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
a = b"a" * 100_000_000  # 100 MB of data
t0 = time.time()
s.send(a)
print(time.time() - t0)  # ~ 0.020s

注意：问题如何提高此 Python 套接字的发送/接收速度？是关于 socket 的包装，所以我想要直接使用纯socket进行测试，无需包装器。

原文

On Windows + Python 3.7 + i5 laptop, it takes 200ms to receive 100MB of data via a socket, that's obviously very low compared to the RAM speed.

How to improve this socket speed on Windows?

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
t0 = time.time()
while True:
    data = conn.recv(8192)  # 8192 instead of 1024 improves from 0.5s to 0.2s
    if data == b'':
        break
print(time.time() - t0)  # ~ 0.200s

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
a = b"a" * 100_000_000  # 100 MB of data
t0 = time.time()
s.send(a)
print(time.time() - t0)  # ~ 0.020s

Note: The question How to improve the send/receive speed of this Python socket? is about a wrapper around socket, so I wanted to test directly with a pure socket, and no wrapper.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鸩远一方 2025-01-20 03:30:58

TL;DR：基于主要在我的机器上收集的事实（并在另一台机器上确认），我可以在该机器上重现类似的行为，该问题似乎主要来自低效的实现Windows 网络 TCP 堆栈。更具体地说，Windows 执行大量临时缓冲区复制，导致 RAM 被密集使用。此外，整体资源没有得到有效利用。话虽这么说，基准也可以改进。

设置

用于执行基准测试的主要目标平台具有以下属性：

操作系统：Windows 10 Famille N（版本 21H1）
处理器：i5-9600KF
RAM：2 个 8GiB DDR4 通道 @ 3200GHz，实际速度可达 40 GiB/s。
CPython 3.8.1

请记住，不同平台的结果可能有所不同。

改进代码/基准

首先，行 a = b"a" * 100_000_000 需要一些时间，该时间包含在服务器的计时中，因为客户端在执行它之前已连接并且服务器应该在此期间接受客户端。最好在 s.connect 调用之前移动此行。

另外，8192的缓冲区非常小。按 8 KiB 的块读取 100 MB 意味着必须执行 12208 个 C 调用，并且可能执行类似数量的系统调用。由于系统调用非常昂贵，因为在大多数平台上它们往往至少需要几毫秒，因此最好在主流处理器上将缓冲区大小增加到至少 32 KiB。缓冲区应该足够小以适合快速 CPU 缓存，但也足够大以减少系统调用量。在我的机器上，使用 256 KiB 缓冲区可实现 70% 的加速。

此外，您需要在客户端代码中关闭套接字，以使服务器代码不会挂起。事实上，否则 conn.recv 应该等待传入的数据。事实上，检查 data == b'' 是否是一个好主意，因为这不是检查流是否结束的安全方法。您需要发送发送的缓冲区的大小或等待给定的预定义大小。例如，流可能会提前中断。或者，客户端可以关闭连接，并且服务器并不总是会直接收到通知（尽管环回速度很快，但有时可能需要很长时间）。

这是修改/改进的基准：

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
a = b"a" * 100_000_000  # 100 MB of data
s.connect(('127.0.0.1', 1234))
t0 = time.time()
s.send(a)
s.close()
print(time.time() - t0)

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
s = 0
t0 = time.time()
while True:
    data = conn.recv(256*1024)
    s += len(data)
    if s == 100_000_000:
        break
print(time.time() - t0)

我重复了 s.send 调用和基于 recv 的循环 100 次，以获得稳定的结果。这样我就可以达到 2.2 GiB/s。 TCP 套接字在大多数平台上往往相当慢，但这个结果显然不是很好（Linux 成功地实现了明显更好的吞吐量）。

在另一台装有 Windows 10 Professional、Skylake Xeon 处理器和 RAM 达到 40 GiB/s 的机器上，我达到了 0.8~1.0 GiB/s，这是非常糟糕的。

分析

性能分析表明，客户端进程经常使 TCP 缓冲区饱和并休眠一小段时间（20~40 ms），等待服务器接收数据。下面是两个进程的调度示例（上面一个是服务器，中间一个是客户端，下面一个是内核线程，浅绿色部分是空闲时间）：

一可以看到服务器没有当客户端填满 TCP 缓冲区时立即唤醒，这是 Windows 调度程序的错过优化。事实上，调度程序可以在服务器饥饿之前唤醒客户端，以减少延迟问题。请注意，不可忽略的部分时间花费在内核进程上，并且时间片与客户端活动相匹配。

总体而言，55% 的时间花在 ws2_32.dll 的 recv 函数上，10% 花在同一 DLL 的 send 函数上，25% 花在同步函数上， 10% 用于其他函数，包括 CPython 解释器的函数。因此，修改后的基准测试不会因 CPython 而减慢。此外，同步并不是速度下降的主要根源。

当进程被调度时，内存吞吐量从 16 GiB/s 上升到 34 GiB/s，平均约为 20 GiB/s，这是相当大的（特别是考虑到同步所花费的时间）。这意味着Windows 执行大量临时缓冲区复制，特别是在 recv 调用期间。

请注意，基于 Xeon 的平台速度较慢的原因肯定是因为处理器只能连续达到 14 GiB/s，而 i5-9600KF 处理器连续达到 24 GiB/s。 Xeon 处理器也以较低的频率运行。对于主要关注可扩展性的基于服务器的处理器来说，这种情况很常见。

对 ws2_32.dll 的深入分析表明，几乎所有的 recv 时间都花在了晦涩的指令 call qword ptr [rip+0x3440f] 上，我猜这是一个内核调用将数据从内核缓冲区复制到用户缓冲区。同样的事情也适用于send。这意味着副本不是在用户态完成的，而是在 Windows 内核本身中完成...

如果您想在 Windows 上的两个进程之间共享数据，我强烈建议您使用 shared内存而不是套接字。一些消息传递库在此基础上提供了抽象（例如 ZeroMQ）。

注意

这里是注释中指出的一些注意事项：

如果增加缓冲区大小不会显着影响性能，那么它肯定意味着代码已经在目标机器上受到内存限制。例如，对于 3 年旧 PC 上常见的 1 DDR4 内存通道 @ 2400 GHz，则最大实际吞吐量约为 14 GiB/s，我预计套接字吞吐量将明显小于 1 GiB/s。在具有基本 1 通道 DDR3 的较旧 PC 上，吞吐量甚至应该接近 500 MiB/s。速度应受诸如 maxMemThroughput / K 之类的限制，其中 K = (N+1) * P，其中：

N 是内存数量复制操作系统执行；
在具有直写式缓存策略的处理器或使用非临时 SIMD 指令的操作系统上，P 等于 2，否则等于 3。

低级分析器显示 Windows 上的 K ~= 8。他们还表明，send 执行有效的复制，受益于非临时存储，并使 RAM 吞吐量相当饱和，而 recv 似乎不使用非临时存储，显然确实如此。不会使 RAM 吞吐量饱和，并且执行的读取次数多于写入次数（由于某些未知原因）。

在 NUMA 系统（例如最近的 AMD 处理器 (Zen) 或多插槽系统）上，情况会更糟，因为 NUMA 节点的互连和饱和会减慢传输速度。众所周知，Windows 在这种情况下表现不佳。

AFAIK，ZeroMQ 有多个后端（又名“多传输”），其中一个使用 TCP（默认）运行，而另一个则使用共享内存运行。

TL;DR: based on the fact gathered mainly on my machine (and confirmed on another machine) on which I can reproduce a similar behaviour, the issue appear to mainly come from an inefficient implementation of the Windows networking TCP stack. More specifically, Windows performs a lot of temporary-buffer copies that cause the RAM to be intensively used. Furthermore, the overall resources are not efficiently used. That being said, the benchmark can be also improved.

Setup

The main target platform used in to perform the benchmark have the following attributes:

OS: Windows 10 Famille N (version 21H1)
Processor: i5-9600KF
RAM: 2 x 8GiB DDR4 channels @ 3200GHz capable of reaching up to 40 GiB/s in practice.
CPython 3.8.1

Please keep in mind that results can differ from one platform to another.

Improving the code/benchmark

First of all, the line a = b"a" * 100_000_000 take a bit of time which is included in the timing of the server since the client is connected before executing it and the server should accept the client during this time. It is better to move this line before the s.connect call.

Additionally, a buffer of 8192 is very small. Reading 100 MB by chunks of 8 KiB means that 12208 C calls must be performed and probably a similar number of system calls. Since system calls are pretty expensive as they tend to take at least few millisecond on most platform, it is better to increase the buffer size to at least 32 KiB on mainstream processors. The buffer should be small enough to fit in fast CPU cache but also big enough to reduce the amount of system calls. On my machine, using a 256 KiB buffer results in a 70% speed up.

Moreover, you need to close the socket in the client code for the server code not to hang on. Indeed, otherwise conn.recv should wait for incoming data. In fact, checking if data == b'' is not a good idea as this is not a safe way to check if the stream is over. You need to send the size of the buffer sent or wait for a given predefined size. For example, the stream can be interrupted prematurely. Alternatively, the client can close the connection and the server will not always be directly notified (it can sometime take a very long time although it is fast on the loopback).

Here is the modified/improved benchmark:

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
a = b"a" * 100_000_000  # 100 MB of data
s.connect(('127.0.0.1', 1234))
t0 = time.time()
s.send(a)
s.close()
print(time.time() - t0)

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
s = 0
t0 = time.time()
while True:
    data = conn.recv(256*1024)
    s += len(data)
    if s == 100_000_000:
        break
print(time.time() - t0)

I repeated the s.send call and the recv-based loop 100 times so to get stable results. With that I can reach 2.2 GiB/s. TCP sockets tends to be pretty slow on most platforms, but this result is clearly not great (Linux succeed to achieve a substantially better throughput).

On a different machine with Windows 10 Professional, a Skylake Xeon processor and a RAM reaching 40 GiB/s, I achieved 0.8~1.0 GiB/s which is very bad.

Analysis

A profiling analysis show that the client process often saturate the TCP buffer and sleep for a short time (20~40 ms) waiting for the server to receive data. Here is an example of scheduling of the two processes (the top one is the server, the middle one is the client, the bottom one is a kernel thread and the light-green parts are idle time):

One can see that the server is not immediately awaken when the client fill the TCP buffer which is a missed-optimization of the Windows scheduler. In fact, the scheduler could wake up the client before the server starvation so to reduce latency issues. Note that a non-negligible part of the time is spent in a kernel process and the time slice are matching with the client activity.

Overall, 55% of the time is spend in the recv function of ws2_32.dll, 10% in the send function of the same DLL, 25% in synchronization functions, and 10% in other functions including ones of the CPython interpreter. Thus, the modified benchmark is not slowed down by CPython. Additionally, synchronizations are not the main source of slowdown.

When processes are scheduled, the memory throughput goes from 16 GiB/s up to 34 GiB/s with an average of ~20 GiB/s which is pretty big (especially considering the time taken by synchronizations). This means Windows performs a lot of big temporary buffer copies, especially during the recv calls.

Note that the reason why the Xeon-based platform is slower is certainly because the processor only succeed to reach 14 GiB/s in sequential while the i5-9600KF processor reach 24 GiB/s in sequential. The Xeon processor also operate at a lower frequency. Such things are common for server-based processors that mainly focus on scalability.

A deeper analysis of ws2_32.dll show that nearly all the time of recv is spent in the obscure instruction call qword ptr [rip+0x3440f] which I guess is a kernel call to copy data from a kernel buffer to the user one. The same thing applies for send. This means that the copies are not done in user-land but in the Windows kernel itself...

If you want to share data between two processes on Windows, I strongly advise you to use shared memory instead of sockets. Some message passing libraries provide an abstraction on top of this (like ZeroMQ for example).

Notes

Here is some notes as pointed out in the comments:

If increasing the buffer size does not impact significantly the performance, then it certainly means that the code is already memory bound on the target machine. For example, with a 1 DDR4 memory channel @ 2400 GHz common on 3-year old PC, then the maximum practical throughput will be about 14 GiB/s and I expect the sockets throughput to be clearly less than 1 GiB/s. On much older PC with a basic 1 channel DDR3, the throughput should even be close to 500 MiB/s. The speed should be bounded by something like maxMemThroughput / K where K = (N+1) * P and where:

N is the number of copy the operating system perform;
P is equal to 2 on processor with a write-through cache policy or operating system using non-temporal SIMD instructions, and 3 otherwise.

Low-level profilers show that K ~= 8 on Windows. They also show that send performs an efficient copy that benefit from non-temporal stores and quite saturate the RAM throughput, while recv seems not to use non-temporal stores, clearly does not saturate the RAM throughput and performs a lot more reads than writes (for some unknown reason).

On NUMA system like recent AMD processors (Zen) or multi-socket systems, this should be even be worse since the interconnect and the saturation of NUMA nodes can slow down transfers. Windows is known to behave badly in this case.

AFAIK, ZeroMQ has multiple backends (aka. "Multi-Transport") and one of them operate with TCP (default) while another operate with shared memory.

回复收藏 0 原文