Golang功能基准测试和Go Go Sudine Call开销的结果令人困惑

发布于 2025-01-17 14:51:28 字数 6508 浏览 1 评论 0原文

出于好奇，我试图了解 golang 的函数和 go 例程调用开销是多少。因此，我写了下面的基准，给出了下面的结果。 BenchmarkNestedFunctions 的结果让我感到困惑，因为它似乎太高了，所以我自然地认为我做错了什么。我预计 BenchmarkNestedFunctions 会略高于 BenchmarkNopFunc，并且非常接近 BenchmarkSplitNestedFunctions。请任何人提出我可能不理解或做错的事情。

package main

import (
    "testing"
)

// Intended to allow me to see the iteration overhead being used in the benchmarking
func BenchmarkTestLoop(b *testing.B) {
    for i := 0; i < b.N; i++ {
    }
}

//go:noinline
func nop() {
}

// Intended to allow me to see the overhead from making a do nothing function call which I hope is not being optimised out
func BenchmarkNopFunc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        nop()
    }
}

// Intended to allow me to see the added cost from creating a channel, closing it and then reading from it
func BenchmarkChannelMakeCloseRead(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        close(done)
        _, _ = <-done
    }
}

//go:noinline
func nestedfunction(n int, done chan<- struct{}) {
    n--
    if n > 0 {
        nestedfunction(n, done)
    } else {
        close(done)
    }
}

// Intended to allow me to see the added cost of making 1 function call doing a set of channel operations for each call
func BenchmarkUnnestedFunctions(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        nestedfunction(1, done)
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of repeated nested calls and stack growth with an upper limit on the call depth to allow examination of a particular stack size
func BenchmarkNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    if b.N > max {
        b.N = max
    }
    done := make(chan struct{})
    nestedfunction(b.N, done)
    _, _ = <-done
}

// Intended to allow me to see the added cost of repeated nested call with any stack reuse the runtime supports (presuming it doesn't free and the realloc the stack as it grows)
func BenchmarkSplitNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    for i := 0; i < b.N; i += max {
        done := make(chan struct{})
        if (b.N - i) > max {
            nestedfunction(max, done)
        } else {
            nestedfunction(b.N-i, done)
        }
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of spinning up a go routine to perform comparable useful work as the nested function calls
func BenchmarkNestedGoRoutines(b *testing.B) {
    done := make(chan struct{})
    go nestedgoroutines(b.N, done)
    _, _ = <-done
}

基准测试的调用方式如下：

$ go test -bench=. -benchmem -benchtime=200ms
goos: windows
goarch: amd64
pkg: golangbenchmarks
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkTestLoop-24                    1000000000               0.2247 ns/op          0 B/op          0 allocs/op
BenchmarkNopFunc-24                     170787386                1.402 ns/op           0 B/op          0 allocs/op
BenchmarkChannelMakeCloseRead-24         3990243                52.72 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            4791862                58.63 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000                50.11 ns/op            0 B/op          0 allocs/op
BenchmarkSplitNestedFunctions-24        155160835                1.528 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24              636734               412.2 ns/op            24 B/op          1 allocs/op
PASS
ok      golangbenchmarks        1.700s

BenchmarkTestLoop、BenchmarkNopFunc 和 BenchmarkSplitNestedFunctions 结果似乎彼此相当一致并且有意义，BenchmarkSplitNestedFunctions平均每个基准操作比 BenchmarkNopFunc 做的工作更多，但并不算多，因为昂贵的 BenchmarkChannelMakeCloseRead 操作大约每 200,000 次基准测试操作才执行一次。

类似地，BenchmarkChannelMakeCloseRead 和 BenchmarkUnnestedFunctions 结果看起来是一致的，因为每个 BenchmarkUnnestedFunctions 的作用比每个 BenchmarkChannelMakeCloseRead 稍多（如果仅通过递减和 if 测试可能会导致分支预测失败（尽管我希望分支预测器能够使用最后一个分支结果，但我不知道关闭函数实现有多复杂，这可能会压倒分支历史记录）

但是 BenchmarkNestedFunctions 和 BenchmarkSplitNestedFunctions 完全不同，我不明白为什么。应该有类似之处，唯一有意的区别是任何增长的堆栈重用，我没想到堆栈增长成本几乎如此之高（或者是解释，结果与结果如此相似只是巧合） BenchmarkChannelMakeCloseRead 结果让我觉得它实际上并没有按照我想象的那样进行？）

还应该注意的是 BenchmarkSplitNestedFunctions 结果有时可能会出现显着不同价值观；重复运行时，我看到一些值在 10 到 200 ns/op 范围内。它也可能无法报告任何结果 ns/op time，但当我运行它时仍然通过；我不知道那里发生了什么：

BenchmarkChannelMakeCloseRead-24         5724488                54.26 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            3992061                57.49 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000               0 B/op          0 allocs/op
BenchmarkNestedFunctions2-24            154956972                1.590 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24             1000000               342.1 ns/op            24 B/op          1 allocs/op

如果有人可以指出我在基准测试中的错误/我对结果的解释并解释到底发生了什么，那么我们将非常感激

背景信息：

堆栈增长和函数内联：https://dave.cheney.net/2020/04/25/inlined-optimizes-in-go
堆栈增长限制：https://dave.cheney.net/2013/06/02/why-is-a-goroutines-stack-infinite
Golang 堆栈结构：https://blog.cloudflare.com/how-stacks-are-handled-in-go/
分支预测：https://en.wikipedia.org/wiki/Branch_predictor
顶级 3900X 架构概述： https://www.techpowerup.com/review/amd-ryzen-9-3900x/3.html
3900X 分支预测历史/缓冲区大小 16/512/7k: https://www.techpowerup.com/review/amd-ryzen-9-3900x/images/arch3.jpg

原文

Out of curiosity, I am trying to understand what the function and go routine call overhead is for golang. I therefore wrote the benchmarks below giving the results below that. The result for BenchmarkNestedFunctions confuses me as it seems far too high so I naturally assume I have done something wrong. I was expecting the BenchmarkNestedFunctions to be slightly higher than the BenchmarkNopFunc and very close to the BenchmarkSplitNestedFunctions. Please can anyone suggest what I may be either not understanding or doing wrong.

package main

import (
    "testing"
)

// Intended to allow me to see the iteration overhead being used in the benchmarking
func BenchmarkTestLoop(b *testing.B) {
    for i := 0; i < b.N; i++ {
    }
}

//go:noinline
func nop() {
}

// Intended to allow me to see the overhead from making a do nothing function call which I hope is not being optimised out
func BenchmarkNopFunc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        nop()
    }
}

// Intended to allow me to see the added cost from creating a channel, closing it and then reading from it
func BenchmarkChannelMakeCloseRead(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        close(done)
        _, _ = <-done
    }
}

//go:noinline
func nestedfunction(n int, done chan<- struct{}) {
    n--
    if n > 0 {
        nestedfunction(n, done)
    } else {
        close(done)
    }
}

// Intended to allow me to see the added cost of making 1 function call doing a set of channel operations for each call
func BenchmarkUnnestedFunctions(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        nestedfunction(1, done)
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of repeated nested calls and stack growth with an upper limit on the call depth to allow examination of a particular stack size
func BenchmarkNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    if b.N > max {
        b.N = max
    }
    done := make(chan struct{})
    nestedfunction(b.N, done)
    _, _ = <-done
}

// Intended to allow me to see the added cost of repeated nested call with any stack reuse the runtime supports (presuming it doesn't free and the realloc the stack as it grows)
func BenchmarkSplitNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    for i := 0; i < b.N; i += max {
        done := make(chan struct{})
        if (b.N - i) > max {
            nestedfunction(max, done)
        } else {
            nestedfunction(b.N-i, done)
        }
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of spinning up a go routine to perform comparable useful work as the nested function calls
func BenchmarkNestedGoRoutines(b *testing.B) {
    done := make(chan struct{})
    go nestedgoroutines(b.N, done)
    _, _ = <-done
}

The benchmarks are invoked as follows:

$ go test -bench=. -benchmem -benchtime=200ms
goos: windows
goarch: amd64
pkg: golangbenchmarks
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkTestLoop-24                    1000000000               0.2247 ns/op          0 B/op          0 allocs/op
BenchmarkNopFunc-24                     170787386                1.402 ns/op           0 B/op          0 allocs/op
BenchmarkChannelMakeCloseRead-24         3990243                52.72 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            4791862                58.63 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000                50.11 ns/op            0 B/op          0 allocs/op
BenchmarkSplitNestedFunctions-24        155160835                1.528 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24              636734               412.2 ns/op            24 B/op          1 allocs/op
PASS
ok      golangbenchmarks        1.700s

The BenchmarkTestLoop, BenchmarkNopFunc and BenchmarkSplitNestedFunctions results seem reasonably consistent with each other and make sense, the BenchmarkSplitNestedFunctions is doing more work than the BenchmarkNopFunc on average per benchmark operation but not by much because the expensive BenchmarkChannelMakeCloseRead operation is only done about once every 200,000 benchmarking operations.

Similarly the BenchmarkChannelMakeCloseRead and BenchmarkUnnestedFunctions results seem consistent since each BenchmarkUnnestedFunctions is doing slightly more than each BenchmarkChannelMakeCloseRead if only by a decrement and if test which is potentially causing a branch prediction failure (although I would have hoped the branch predicter would have been able to use the last branch result, but I don't know how complex the close function implementation is which may be overwhelming the branch history)

However BenchmarkNestedFunctions and BenchmarkSplitNestedFunctions are radically different and I don't understand why. There should be similar with the only intentional difference being any grown stack re-use and I did not expect the stack growth cost to be nearly so high (or is that the explanation and it is just co-incidence that result is so similar to the BenchmarkChannelMakeCloseRead result making me think it is not actually doing what I thought it was?)

It should also be noted that the BenchmarkSplitNestedFunctions result can occasionally take significantly different values; I have seen a few values in the range of 10 to 200 ns/op when running it repeatedly. It can also fail to report any result ns/op time while still passing when I run it; I have no idea what is going on there:

BenchmarkChannelMakeCloseRead-24         5724488                54.26 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            3992061                57.49 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000               0 B/op          0 allocs/op
BenchmarkNestedFunctions2-24            154956972                1.590 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24             1000000               342.1 ns/op            24 B/op          1 allocs/op

If anyone can point out my mistake in the benchmark / my interpretation of the results and explain what is really happening then that would be greatly appreciated

Background info: