Go 标准库 http 与 fasthttp 服务端性能比较

发布于 2023-08-08 08:41:11 字数 15559 浏览 44 评论 0

1. 背景

Go 初学者学习 Go 时，在编写了经典的 hello, world 程序之后，可能会迫不及待的体验一下 Go 强大的标准库，比如：用几行代码写一个像下面示例这样拥有完整功能的 web server：

// 来自 https://tip.golang.org/pkg/net/http/#example_ListenAndServe 
package main
import (
 "io"
 "log"
 "net/http"
)
func main() {
 helloHandler := func(w http.ResponseWriter, req *http.Request) {
  io.WriteString(w, "Hello, world!\n")
 }
 http.HandleFunc("/hello", helloHandler)
 log.Fatal(http.ListenAndServe(":8080", nil))
}

go net/http 包是一个比较均衡的通用实现，能满足大多数 gopher 90%以上场景的需要，并且具有如下优点：

标准库包，无需引入任何第三方依赖；
对 http 规范的满足度较好；
无需做任何优化，即可获得相对较高的性能；
支持 HTTP 代理；
支持 HTTPS；
无缝支持 HTTP/2。

不过也正是因为 http 包的“均衡”通用实现，在一些对性能要求严格的领域，net/http 的性能可能无法胜任，也没有太多的调优空间。这时我们会将眼光转移到其他第三方的 http 服务端框架实现上。

而在第三方 http 服务端框架中，一个“行如其名”的框架 fasthttp[1]被提及和采纳的较多，fasthttp 官网宣称其性能是 net/http 的十倍(基于 go test benchmark 的测试结果)。

fasthttp 采用了许多性能优化上的最佳实践[2]，尤其是在内存对象的重用上，大量使用 sync.Pool[3]以降低对 Go GC 的压力。那么在真实环境中，到底 fasthttp 能比 net/http 快多少呢？恰好手里有两台性能还不错的服务器可用，在本文中我们就在这个真实环境下看看他们的实际性能。

2. 性能测试

我们分别用 net/http 和 fasthttp 实现两个几乎“零业务”的被测程序：

nethttp：

// github.com/bigwhite/experiments/blob/master/http-benchmark/nethttp/main.go
package main
import (
 _ "expvar"
 "log"
 "net/http"
 _ "net/http/pprof"
 "runtime"
 "time"
)
func main() {
 go func() {
  for {
   log.Println("当前 routine 数量:", runtime.NumGoroutine())
   time.Sleep(time.Second)
  }
 }()
 http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
  w.Write([]byte("Hello, Go!"))
 }))
 log.Fatal(http.ListenAndServe(":8080", nil))
}

fasthttp：

// github.com/bigwhite/experiments/blob/master/http-benchmark/fasthttp/main.go
package main
import (
 "fmt"
 "log"
 "net/http"
 "runtime"
 "time"
 _ "expvar"
 _ "net/http/pprof"
 "github.com/valyala/fasthttp"
)
type HelloGoHandler struct {
}
func fastHTTPHandler(ctx *fasthttp.RequestCtx) {
 fmt.Fprintln(ctx, "Hello, Go!")
}
func main() {
 go func() {
  http.ListenAndServe(":6060", nil)
 }()
 go func() {
  for {
   log.Println("当前 routine 数量:", runtime.NumGoroutine())
   time.Sleep(time.Second)
  }
 }()
 s := &fasthttp.Server{
  Handler: fastHTTPHandler,
 }
 s.ListenAndServe(":8081")
}

对被测目标实施压力测试的客户端，我们基于 hey[4] 这个 http 压测工具进行，为了方便调整压力水平，我们将 hey 包裹在下面这个 shell 脚本中（仅适于在 linux 上运行)：

// github.com/bigwhite/experiments/blob/master/http-benchmark/client/http_client_load.sh
# ./http_client_load.sh 3 10000 10 GET  http://10.10.195.181:8080 
echo "$0 task_num count_per_hey conn_per_hey method url"
task_num=$1
count_per_hey=$2
conn_per_hey=$3
method=$4
url=$5
start=$(date +%s%N)
for((i=1; i<=$task_num; i++)); do {
 tm=$(date +%T.%N)
        echo "$tm: task $i start"
 hey -n $count_per_hey -c $conn_per_hey -m $method $url > hey_$i.log
 tm=$(date +%T.%N)
        echo "$tm: task $i done"
} & done
wait
end=$(date +%s%N)
count=$(( $task_num * $count_per_hey ))
runtime_ns=$(( $end - $start ))
runtime=`echo "scale=2; $runtime_ns / 1000000000" | bc`
echo "runtime: "$runtime
speed=`echo "scale=2; $count / $runtime" | bc`
echo "speed: "$speed

该脚本的执行示例如下：

bash http_client_load.sh 8 1000000 200 GET   http://10.10.195.134:8080  
http_client_load.sh task_num count_per_hey conn_per_hey method url
16:58:09.146948690: task 1 start
16:58:09.147235080: task 2 start
16:58:09.147290430: task 3 start
16:58:09.147740230: task 4 start
16:58:09.147896010: task 5 start
16:58:09.148314900: task 6 start
16:58:09.148446030: task 7 start
16:58:09.148930840: task 8 start
16:58:45.001080740: task 3 done
16:58:45.241903500: task 8 done
16:58:45.261501940: task 1 done
16:58:50.032383770: task 4 done
16:58:50.985076450: task 7 done
16:58:51.269099430: task 5 done
16:58:52.008164010: task 6 done
16:58:52.166402430: task 2 done
runtime: 43.02
speed: 185960.01

从传入的参数来看，该脚本并行启动了 8 个 task(一个 task 启动一个 hey)，每个 task 向 http://10.10.195.134:8080 建立 200 个并发连接，并发送 100w http GET 请求。

我们使用两台服务器分别放置被测目标程序和压力工具脚本：

目标程序所在服务器：10.10.195.181(物理机，Intel x86-64 CPU，40 核，128G 内存, CentOs 7.6)

$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core) 
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
座：                 2
NUMA 节点：         2
厂商 ID：           GenuineIntel
CPU 系列：          6
型号：              85
型号名称：        Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
步进：              4
CPU MHz：             800.000
CPU max MHz:           2201.0000
CPU min MHz:           800.0000
BogoMIPS：            4400.00
虚拟化：           VT-x
L1d 缓存：          32K
L1i 缓存：          32K
L2 缓存：           1024K
L3 缓存：           14080K
NUMA 节点 0 CPU：    0-9,20-29
NUMA 节点 1 CPU：    10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good 
nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl 
vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic 
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 
3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp 
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle 
avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq 
rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl 
xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total 
cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

压力工具所在服务器：10.10.195.133(物理机，鲲鹏 arm64 cpu，96 核，80G 内存, CentOs 7.9)

# cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (AltArch)
# lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
座：                 2
NUMA 节点：         4
型号：              0
CPU max MHz:           2600.0000
CPU min MHz:           200.0000
BogoMIPS：            200.00
L1d 缓存：          64K
L1i 缓存：          64K
L2 缓存：           512K
L3 缓存：           49152K
NUMA 节点 0 CPU：    0-23
NUMA 节点 1 CPU：    24-47
NUMA 节点 2 CPU：    48-71
NUMA 节点 3 CPU：    72-95
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 
atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

我用 dstat 监控被测目标所在主机资源占用情况（dstat -tcdngym），尤其是 cpu 负荷；通过 expvarmon 监控 memstats[5] 查看目标程序中对各类资源消耗情况的排名。

下面是多次测试后制作的一个数据表格：

图：测试数据

3. 对结果的简要分析

受特定场景、测试工具及脚本精确性以及压力测试环境的影响，上面的测试结果有一定局限，但却真实反映了被测目标的性能趋势。我们看到在给予同样压力的情况下，fasthttp 并没有 10 倍于 net http 的性能，甚至在这样一个特定的场景下，两倍于 net/http 的性能都没有达到：我们看到在目标主机 cpu 资源消耗接近 70%的几个用例中，fasthttp 的性能仅比 net/http 高出 30%~70%左右。

那么为什么 fasthttp 的性能未及预期呢？要回答这个问题，那就要看看 net/http 和 fasthttp 各自的实现原理了！我们先来看看 net/http 的工作原理示意图：

图：nethttp 工作原理示意图

http 包作为 server 端的原理很简单，那就是 accept 到一个连接(conn)之后，将这个 conn 甩给一个 worker goroutine 去处理，后者一直存在，直到该 conn 的生命周期结束：即连接关闭。

下面是 fasthttp 的工作原理示意图：

图：fasthttp 工作原理示意图

而 fasthttp 设计了一套机制，目的是尽量复用 goroutine，而不是每次都创建新的 goroutine。fasthttp 的 Server accept 一个 conn 之后，会尝试从 workerpool 中的 ready 切片中取出一个 channel，该 channel 与某个 worker goroutine 一一对应。一旦取出channel，就会将 accept 到的 conn 写到该 channel 里，而 channel 另一端的 worker goroutine 就会处理该 conn 上的数据读写。当处理完该 conn 后，该 worker goroutine 不会退出，而是会将自己对应的那个 channel 重新放回workerpool 中的 ready 切片中，等待这下一次被取出。

fasthttp 的 goroutine 复用策略初衷很好，但在这里的测试场景下效果不明显，从测试结果便可看得出来，在相同的客户端并发和压力下，net/http 使用的 goroutine 数量与 fasthttp 相差无几。这是由测试模型导致的：在我们这个测试中，每个 task 中的 hey 都会向被测目标发起固定数量的长连接(keep-alive) ，然后在每条连接上发起饱和请求。

这样 fasthttp workerpool 中的 goroutine 一旦接收到某个 conn 就只能在该 conn 上的通讯结束后才能重新放回，而该 conn 直到测试结束才会 close，因此这样的场景相当于让 fasthttp“退化”成了 net/http 的模型，也染上了 net/http 的缺陷：goroutine 的数量一旦多起来，go runtime 自身调度所带来的消耗便不可忽视甚至超过了业务处理所消耗的资源占比。下面分别是 fasthttp 在 200 长连接、8000 长连接以及 16000 长连接下的 cpu profile 的结果：

200 长连接：
(pprof) top -cum
Showing nodes accounting for 88.17s, 55.35% of 159.30s total
Dropped 150 nodes (cum <= 0.80s)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
     0.46s  0.29%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*Server).serveConn
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.04s 0.025%  0.31%     89.46s 56.16%  internal/poll.ignoringEINTRIO (inline)
    87.38s 54.85% 55.17%     89.27s 56.04%  syscall.Syscall
     0.12s 0.075% 55.24%     60.39s 37.91%  bufio.(*Writer).Flush
         0     0% 55.24%     60.22s 37.80%  net.(*conn).Write
     0.08s  0.05% 55.29%     60.21s 37.80%  net.(*netFD).Write
     0.09s 0.056% 55.35%     60.12s 37.74%  internal/poll.(*FD).Write
         0     0% 55.35%     59.86s 37.58%  syscall.Write (inline)
(pprof) 
8000 长连接：
(pprof) top -cum
Showing nodes accounting for 108.51s, 54.46% of 199.23s total
Dropped 204 nodes (cum <= 1s)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.69s  0.35%  0.35%    119.05s 59.76%  github.com/valyala/fasthttp.(*Server).serveConn
     0.04s  0.02%  0.37%    104.22s 52.31%  internal/poll.ignoringEINTRIO (inline)
   101.58s 50.99% 51.35%    103.95s 52.18%  syscall.Syscall
     0.10s  0.05% 51.40%     79.95s 40.13%  runtime.mcall
     0.06s  0.03% 51.43%     79.85s 40.08%  runtime.park_m
     0.23s  0.12% 51.55%     79.30s 39.80%  runtime.schedule
     5.67s  2.85% 54.39%     77.47s 38.88%  runtime.findrunnable
     0.14s  0.07% 54.46%     68.96s 34.61%  bufio.(*Writer).Flush
16000 长连接：
(pprof) top -cum
Showing nodes accounting for 239.60s, 87.07% of 275.17s total
Dropped 190 nodes (cum <= 1.38s)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
     0.04s 0.015% 0.015%    153.38s 55.74%  runtime.mcall
     0.01s 0.0036% 0.018%    153.34s 55.73%  runtime.park_m
     0.12s 0.044% 0.062%       153s 55.60%  runtime.schedule
     0.66s  0.24%   0.3%    152.66s 55.48%  runtime.findrunnable
     0.15s 0.055%  0.36%    127.53s 46.35%  runtime.netpoll
   127.04s 46.17% 46.52%    127.04s 46.17%  runtime.epollwait
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.41s  0.15% 46.67%    120.18s 43.67%  github.com/valyala/fasthttp.(*Server).serveConn
   111.17s 40.40% 87.07%    111.99s 40.70%  syscall.Syscall
(pprof)

通过上述 profile 的比对，我们发现当长连接数量增多时（即 workerpool 中 goroutine 数量增多时），go runtime 调度的占比会逐渐提升，在 16000 连接时，runtime 调度的各个函数已经排名前 4 了。

4. 优化途径

从上面的测试结果，我们看到 fasthttp 的模型不太适合这种连接连上后进行持续“饱和”请求的场景，更适合短连接或长连接但没有持续饱和请求，在后面这样的场景下，它的 goroutine 复用模型才能更好的得以发挥。

但即便退化为了 net/http 模型，fasthttp 的性能依然要比 net/http 略好，这是为什么呢？这些性能提升主要是 fasthttp 在内存分配层面的优化 trick 的结果，比如大量使用 sync.Pool，比如避免在[]byte 和 string 互转等。

那么，在持续饱和请求的场景下，如何让 fasthttp workerpool 中 goroutine 的数量不会因 conn 的增多而线性增长呢？fasthttp 官方没有给出答案，但一条可以考虑的路径是使用 os 的多路复用(linux 上的实现为 epoll)，即 go runtime netpoll 使用的那套机制。

在多路复用的机制下，这样可以让每个 workerpool 中的 goroutine 处理同时处理多个连接，这样我们可以根据业务规模选择 workerpool 池的大小，而不是像目前这样几乎是任意增长 goroutine 的数量。当然，在用户层面引入 epoll 也可能会带来系统调用占比的增多以及响应延迟增大等问题。至于该路径是否可行，还是要看具体实现和测试结果。

注：fasthttp.Server 中的 Concurrency 可以用来限制 workerpool 中并发处理的 goroutine 的个数，但由于每个 goroutine 只处理一个连接，当 Concurrency 设置过小时，后续的连接可能就会被 fasthttp 拒绝服务。因此 fasthttp 的默认 Concurrency 为：