低延迟网络技术和灵丹妙药
在对低延迟网络进行一些基本的谷歌搜索之后,我列出了程序员和系统设计人员在开始低延迟网络时应考虑的事项列表:
硬件、系统和协议的设计必须一起考虑
使用 UDP 而不是 TCP 开发协议,并在应用层实现简单的 ack-nak、重发逻辑
减少消耗和打包离线数据的进程或线程的上下文切换数量(最好为零)
使用操作系统的最佳选择器(select、kqueue、epoll 等)
使用具有大量板载缓冲区 (fifo) 的优质网卡和交换机
使用多个 NIC,专门用于下行和上行数据流
减少其他设备或软件生成的 IRQ 数量(简而言之,如果不需要,请将其删除)
减少互斥体和条件的使用。相反,在可能的情况下使用无锁编程技术。利用该架构的 CAS 功能。 (无锁容器)
考虑单线程而不是多线程设计 - 上下文切换非常昂贵。
了解并正确利用架构的缓存系统(L1/L2、RAM 等)
更喜欢完全控制内存管理,而不是比委托给垃圾收集器
使用优质电缆,使电缆尽可能短,减少扭曲和卷曲的次数
我的问题:我想知道SOers们认为在开始低延迟网络时还有哪些重要的事情。
请随意批评上述任何一点
After some basic googling of low-latency networking I've come up with the following list of things programmers and system designers should consider when embarking on low latency networking:
The design of the hardware, systems and protocols have to be considered together
Develop protocols using UDP instead of TCP and implement simple ack-nak, resend logic at the application level
Reduce the number of context switches (preferably to zero) for the process or thread that consumes and packetizes data off the wire
Use the best selector for the OS (select, kqueue, epoll etc)
Use good quality NICs and Switches with large amounts of on-board buffer (fifo)
Use multiple NICs, specifically for down-stream and up-stream data flows
Reduce the number of IRQs being generated by other devices or software (in short remove them if they are not required)
Reduce the usage of mutexes and conditions. Instead where possible use Lock-Free programming techniques. Make use of the architecture's CAS capabilities. (Lock-Free containers)
Consider single threaded over multi-threaded designs - context switches are very expensive.
Understand and properly utilize your architecture's cache system (L1/L2, RAM etc)
Prefer complete control over memory management, rather than delegating to Garbage Collectors
Use good quality cables, keep the cables as short as possible, reduce the number of twists and curls
My question: I was wondering what other things fellow SOers believe are important when embarking on low latency networking.
Feel free to critique any of the above points
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
电缆质量通常是一种转移注意力的因素。我会更多地考虑连接网络分析仪,看看是否有足够的重传值得关注。如果出现很多问题,请尝试隔离发生问题的位置,并更换导致问题的电缆。如果您没有收到导致重新传输的错误,则电缆(实际上)对延迟没有影响。
NIC 和(尤其是)交换机上的大缓冲区本身不会减少延迟。事实上,为了真正最小化延迟,您通常希望使用尽可能小的缓冲区,而不是更大的缓冲区。位于缓冲区中而不是立即处理的数据会增加延迟。说实话,这很少值得担心,但仍然如此。如果您真的想要最大限度地减少延迟(并且不太关心带宽),那么最好使用集线器而不是交换机(有点难以找到,但只要满足以下条件,延迟肯定很低)网络拥塞足够低)。
多个 NIC 可以极大地提高带宽,但它们对延迟的影响通常非常小。
编辑:然而,我的主要建议是了解规模。将网络电缆减少一英尺可以节省大约一纳秒——与通过几个汇编语言指令加速数据包处理的一般顺序相同。
底线:与任何其他优化一样,要取得更大的成果,您需要先测量延迟的位置,然后才能采取措施减少延迟。在大多数情况下,减少电线长度(举一个例子)不会产生足够明显的差异,仅仅是因为它一开始就很快。如果某件事一开始需要 10 微秒,那么您所做的任何事情都无法使其速度超过 10 微秒,因此,除非您的事情快到 10 微秒占您时间的很大一部分,否则不值得攻击。
Cable quality is usually kind of a red herring. I'd think more about connecting up a network analyzer to see whether you're getting enough re-transmissions to care about. If you're getting very many, try to isolate where they're happening, and replace the cable(s) that is/are causing the problem. If you're not getting errors that result in re-transmissions, then the cable has (virtually) no effect on latency.
Large buffers on NICs and (especially) switches won't, themselves, reduce latency. In fact, to truly minimize latency, you normally want to use the smallest buffers you can, not larger ones. Data sitting in a buffer instead of being processed immediately increases latency. Truthfully, it's rarely worth worrying about, but still. If you really want to minimize latency (and care a lot less about bandwidth) you'd be better off using a hub than a switch (kind of hard to find anymore, but definitely low latency as long as network congestion is low enough).
Multiple NICs can help bandwidth a lot, but their effect on latency is generally pretty minimal.
Edit: My primary advice, however, would be to get a sense of scale. Reducing a network cable by a foot saves you about a nanosecond -- on the same general order as speeding up packet processing by a couple of assembly language instructions.
Bottom line: Like any other optimization, to get very far you need to measure where you're getting latency before you can do much to reduce it. In most cases, reducing wire lengths (to use one example) won't make enough difference to notice, simply because it's fast to start with. If something starts out taking 10 microseconds, nothing you can do is going to speed it up any more than 10 microseconds, so unless you have things so fast that 10 us is a significant percentage of your time, it's not worth attacking.
其他:
1:使用用户层网络堆栈
2:在与处理代码相同的套接字上提供服务中断(共享缓存)
3:更喜欢固定长度协议,即使它们的字节稍大(更快的解析)
4:忽略网络字节顺序约定并仅使用本机排序
5:从不在例程和对象池中分配(尤其是垃圾收集语言)
6:尝试尽可能防止字节复制(在 TCP 发送中很难)
7:使用直通切换模式
8:破解网络堆栈以删除 TCP 慢启动
9:通告一个巨大的 TCP 窗口(但不要使用它),以便另一方可以同时拥有大量正在传输的数据包
10:关闭 NIC 合并,特别是对于发送(如果需要,请在应用程序堆栈中打包)
11:更喜欢铜而不是光纤
我可以继续,但这应该让人们思考
我不同意的一个:
1:网络电缆很少成为问题,除非坏了(有是一个例外就电缆类型而言)
Others:
1: use userland networking stacks
2: service interrupts on the same socket as the handing code (shared cache)
3: prefer fixed length protocols, even if they are a little larger in bytes (quicker parsing)
4: ignore the network byte order convention and just use native ordering
5: never allocate in routines and object pool (esp. garbage collected languages)
6: try to prevent byte copying as much as possible (hard in TCP send)
7: use cut-through switching mode
8: hack networking stack to remove TCP slow start
9: advertise a huge TCP window (but don't use it) so the other side can have a lot of inflight packets at a time
10: turn off NIC coalescing, especially for send (packetize in the app stack if you need to)
11: prefer copper over optic
I can keep going, but that should get people thinking
One I don't agree with:
1: network cables are rarely an issue except when gone bad (there is an exception to this in terms of cable type)
这可能有点明显,但这是一种我很满意的技术,它适用于 UDP 和 TCP,所以我将写一下它:
1) 永远不要对大量传出数据进行排队:具体来说,尽量避免将内存中的数据结构编组到序列化字节缓冲区中,直到最后一刻。相反,当您的发送套接字 select() 为可写状态时,将当时相关/脏数据结构的当前状态展平,并立即 send() 将它们输出。这样数据就永远不会在发送端“堆积”。 (另外,请确保将套接字的 SO_SNDBUF 设置为尽可能小,以最大程度地减少内核内的数据排队)
2)您可以在接收端执行类似的操作,假设您的数据以某种方式键入:您可以读取所有可用的数据消息并将它们放入键控数据结构(例如哈希表)中,而不是执行(读取数据消息,处理数据消息,重复)循环,直到套接字没有更多数据可供读取,然后(并且只有那时)迭代数据结构并处理数据。这样做的优点是,如果您的接收客户端必须对接收到的数据进行任何重要的处理,那么过时的传入消息将被自动/隐式丢弃(因为它们的替换会在密钥数据结构中覆盖它们),因此传入的数据包将获胜不会备份到内核的传入消息队列中。 (当然,您可以让内核的队列填满并丢弃数据包,但是您的程序最终会读取“旧”数据包并丢弃“新”数据包,这通常不是您想要的)。作为进一步的优化,您可以让 I/O 线程将键控数据结构移交给单独的处理线程,以便 I/O 不会因处理而被推迟。
This may be a bit obvious, but it's a technique that I'm happy with and it works with both UDP and TCP, so I'll write about it:
1) Never queue up significant amounts of outgoing data: specifically, try to avoid marshalling your in-memory data structures into serialized-byte-buffers until the last possible moment. Instead, when your sending socket select()s as ready-for-write, flatten the current state of the relevant/dirty data structures at that time, and send() them out immediately. That way data will never "build up" on the sending side. (also, be sure to set the SO_SNDBUF of your socket to as small as you can get away with, to minimize data queueing inside the kernel)
2) You can do something similar on the receiving side, assuming your data is keyed in some way: instead of doing a (read data message, process data message, repeat) loop, you can read all available data messages and just place them into a keyed data structure (e.g. a hash table) until the socket has no more data available to read, and then (and only then) iterate over the data structure and process the data. The advantage of this is that if your receiving client has to do any non-trivial processing on the received data, then obsolete incoming messages will be automatically/implicitly dropped (as their replacement overwrites them in the keyed data structure) and so incoming packets won't back up in the kernel's incoming message queue. (You could just let the kernel's queue fill up and drop packets, of course, but then your program ends up reading the 'old' packets and dropping the 'newer' ones, which isn't usually what you want). As a further optimization, you could have the I/O thread hand the keyed data structure over to a separate processing thread, so that the I/O won't get held off by the processing.