在结构阵列上循环时的性能差异

发布于 2025-02-11 13:07:36 字数 3470 浏览 1 评论 0原文

我最近开始使用Julia加快以前用Python编写的代码。我只有对Python的事先经验，所以这是我第一次关心性能，在循环一系列结构时，我发现了一些奇怪的行为。我正在定义一个新的结构高斯，该高斯代表一个2D高斯函数和功能强度（），该功能强度（）计算给定位置处函数的幅度：

struct Gaussian{T<:Float32}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end

function intensity(
    model::Gaussian,
    x::Float32,
    y::Float32
    )
    gaussian_value::Float32 = model.A*exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return gaussian_value
end

然后，我制作了高斯的2000个随机实例的数组：

function build_array()
    length = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:length]
    random_A = rand(Float32, (length, 1))
    random_a = rand(Float32, (length, 1))
    random_b = rand(Float32, (length, 1))
    random_c = rand(Float32, (length, 1));

    gaussians::Array{Gaussian} = []

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

当我对单个进行基准时，调用强度（）函数，用1个分配需要〜100 ns（有道理）。我希望对高斯人的阵列进行循环，然后将2000*100 ns = 200。但是，它实际上需要大约两倍的时间：

function total_intensity1(gaussian_list::Array{Gaussian})
    total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end

function total_intensity2(gaussian_list::Array{Gaussian})
    total::Float32 = 0.
    for gaussian in gaussian_list
        total += intensity(gaussian, Float32(0.75), Float32(0.11))
    end
    return total 
end

@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
    total::Float32 = 0.
    for gauss in gaussians
        total += intensity(gauss, Float32(0.75), Float32(0.11))
    end
    total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)

  397.700 μs (16004 allocations: 258.02 KiB)
  285.800 μs (8980 allocations: 234.06 KiB)
  396.100 μs (16002 allocations: 257.95 KiB)
  396.700 μs (16001 allocations: 250.02 KiB)

分配的数量也比我预期的要大得多，即使代码几乎相同，第二和第四方法之间也有区别。我的问题：

这些差异来自哪里？
如何提高代码的性能？

编辑：作为参考，我最终将代码更改为以下内容：

struct Gaussian
    x0::Float32
    y0::Float32
    A::Float32
    a::Float32
    b::Float32
    c::Float32
end

function build_array()
    N = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:N]
    random_A = rand(Float32, N)
    random_a = rand(Float32, N)
    random_b = rand(Float32, N)
    random_c = rand(Float32, N);

    gaussians = Gaussian[]

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

function intensity(
    model::Gaussian,
    x,
    y
    )
    (;x0, y0, A, a, b, c) = model
    A*exp(-(a * (x - x0)^2 + 2 * b * (x - x0) * (y - y0) + c * (y - y0)^2))
end

function total_intensity(gaussian_list::Vector{<:Gaussian})
    total = sum(g->intensity(g, Float32(0.75), Float32(0.11)), gaussian_list)
end

@btime total_intensity($gaussians)

它运行得更快：

10.900 μs (0 allocations: 0 bytes)

感谢Nils Gudat和DNF的建议！

原文

I have recently started using Julia to speed up some code previously written in Python. I only have prior experience with Python, so this is my first time caring about performance and I have found some strange behavior when looping over an array of structs. I am defining a new struct Gaussian, which represent a 2d Gaussian function and a function intensity() which calculates the amplitude of the function at a given position:

struct Gaussian{T<:Float32}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end

function intensity(
    model::Gaussian,
    x::Float32,
    y::Float32
    )
    gaussian_value::Float32 = model.A*exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return gaussian_value
end

Then, I make an array of 2000 random instances of Gaussian:

function build_array()
    length = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:length]
    random_A = rand(Float32, (length, 1))
    random_a = rand(Float32, (length, 1))
    random_b = rand(Float32, (length, 1))
    random_c = rand(Float32, (length, 1));

    gaussians::Array{Gaussian} = []

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

When I benchmark a single call to the intensity() function, it takes ~100 ns with 1 allocation (makes sense). I would expect that looping over the array of Gaussians should then take 2000*100 ns = 200 us. However, it actually takes about twice as long:

function total_intensity1(gaussian_list::Array{Gaussian})
    total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end

function total_intensity2(gaussian_list::Array{Gaussian})
    total::Float32 = 0.
    for gaussian in gaussian_list
        total += intensity(gaussian, Float32(0.75), Float32(0.11))
    end
    return total 
end

@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
    total::Float32 = 0.
    for gauss in gaussians
        total += intensity(gauss, Float32(0.75), Float32(0.11))
    end
    total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)

  397.700 μs (16004 allocations: 258.02 KiB)
  285.800 μs (8980 allocations: 234.06 KiB)
  396.100 μs (16002 allocations: 257.95 KiB)
  396.700 μs (16001 allocations: 250.02 KiB)

The number of allocations is also much larger than I would expect and there is a difference between the second and fourth method even though the code is pretty much the same. My questions:

Where do these differences come from?
How can I improve the performance of the code?

EDIT:
For reference, I ended up changing my code to the following:

struct Gaussian
    x0::Float32
    y0::Float32
    A::Float32
    a::Float32
    b::Float32
    c::Float32
end

function build_array()
    N = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:N]
    random_A = rand(Float32, N)
    random_a = rand(Float32, N)
    random_b = rand(Float32, N)
    random_c = rand(Float32, N);

    gaussians = Gaussian[]

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

function intensity(
    model::Gaussian,
    x,
    y
    )
    (;x0, y0, A, a, b, c) = model
    A*exp(-(a * (x - x0)^2 + 2 * b * (x - x0) * (y - y0) + c * (y - y0)^2))
end

function total_intensity(gaussian_list::Vector{<:Gaussian})
    total = sum(g->intensity(g, Float32(0.75), Float32(0.11)), gaussian_list)
end

@btime total_intensity($gaussians)

Which runs much faster:

10.900 μs (0 allocations: 0 bytes)

Thank you to Nils Gudat and DNF for their suggestions!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别理我 2025-02-18 13:07:36

tldr版本：vector {gaussian}应为vector {gaussian {float32}}}。

您的struct定义高斯{t＆lt;：float32}有些毫无意义。 float32不能具有任何子类型，因此t can 唯一的是float32。因此，要么删除限制，要替换为其他东西（例如real），或者完全取出类型参数。

这是不好的：

gaussians::Array{Gaussian} = []

它创建vector {any}，然后将其转换为vector {Gaussian}。更糟糕的是，vector {Gaussian}是不是 a vector {gaussian {float32}}。因此，要么删除整个类型参数，要么确保使用它。因此，

# good:
gaussians = Vector{Gaussian{Float32}}()
gaussians = Gaussian{Float32}[] # same as above

# bad
gaussians = Vector{Gaussian}()
# very bad, don't use this style, put types on the right hand side when constructing.
gaussians::Vector{Gaussian} = []

在这里，不良样式相反，

total::Float32 = 0.

使用动态语言来做到这一点

total = Float32(0.0)
# or use Float32 literal
total = 0.0f0
# or the generic way
total = zero(Float32)

，类型属于 values ，而不是变量。

顺便说一句，您必须修改一些功能定义：

total_intensity1(gaussian_list::Array{Gaussian})

应该

total_intensity1(gaussian_list::Array{<:Gaussian})

还有更多，但这是一个开始。

编辑：好，还有更多内容：

rand（float32，（长度，1））
长度是基础中的一个超重要功能，因此通常不像这样遮盖它是很好的。并且，制作向量而不是矩阵：
rand（float32，（n，1））＃这是一个NX1矩阵
rand（float32，n）＃这是一个长度 - n vector
推！（高斯，new_gaussian）
迭代地对向量进行了一遍又一遍地的大小。当您知道矢量的大小与您的情况一样，最好预先分配：
gaussians = vector {gaussian {float32}}}（undef，2000）
您可以避免不必要在这里分配：
total = sum（强度。像这样：
total = sum（g-＆gt; intense（g，0.75f0，0.11f0），gaussian_list）

说明：sum（f。（x））首先创建数组f。（x），然后将其概括，而sum（f，x）仅将f应用于每个元素，然后将其添加到总和。

这是具有基准测试的实现：

struct Gaussian{T<:Real}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end
Gaussian(x::Real...) = Gaussian(promote(x...)...)

function intensity(model::Gaussian, x::Real, y::Real)
    val = model.A * exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return val
end

function build_array(N=2000)
    return [Gaussian(ntuple(_->rand(Float32), 6)...) for _ in 1:N]
end

基准（请记住要插值变量，避免全局范围）：

julia> gaussians = build_array(2000);

julia> @btime sum(intensity.($gaussians, Float32(0.75), Float32(0.11)))
  14.600 μs (1 allocation: 7.94 KiB)
947.5305f0

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  12.600 μs (0 allocations: 0 bytes)
947.5309f0

最终总和存在略有差异，因为矢量的总和使用了一种称为成对求和的数字上优质方法。

最终奖励：尝试使用多线程的Tullio.jl。对于2000个元素，它没有任何区别，但是对于较大的数组（在此处使用12个线程），它确实可以：

julia> using Tullio, LoopVectorization

julia> gaussians = build_array(200_000);

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  1.228 ms (0 allocations: 0 bytes)
92722.7f0

julia> @btime @tullio s := intensity($gaussians[i], 0.75f0, 0.11f0)
  330.100 μs (197 allocations: 11.53 KiB)
92722.79f0

TLDR version: Vector{Gaussian} should be Vector{Gaussian{Float32}}.

Your struct definition Gaussian{T<:Float32} is somewhat nonsensical. Float32 cannot have any subtypes, so T can only be a Float32. Therefore, either remove the restriction, replace it with something else (e.g. Real), or just take away the type parameter entirely.

This is bad:

gaussians::Array{Gaussian} = []

It creates a Vector{Any} which it then converts to a Vector{Gaussian}. Worse, Vector{Gaussian} is not a Vector{Gaussian{Float32}}. So either remove the whole type parameter, or make sure to use it. So,

# good:
gaussians = Vector{Gaussian{Float32}}()
gaussians = Gaussian{Float32}[] # same as above

# bad
gaussians = Vector{Gaussian}()
# very bad, don't use this style, put types on the right hand side when constructing.
gaussians::Vector{Gaussian} = []

Same here, bad style

total::Float32 = 0.

Do this instead

total = Float32(0.0)
# or use Float32 literal
total = 0.0f0
# or the generic way
total = zero(Float32)

In dynamic languages, types belong to values, not to variables.

BTW, you'll have to modify some of your function definitions:

total_intensity1(gaussian_list::Array{Gaussian})

should be

total_intensity1(gaussian_list::Array{<:Gaussian})

There's more, but this is a start.

Edit: OK, a few more things:

rand(Float32, (length, 1))
length is a super important function in Base, so it's normally good not to shadow it like this. And, make vectors instead of matrices:
rand(Float32, (N, 1)) # this is an Nx1 matrix
rand(Float32, N) # this is a length-N vector
push!(gaussians, new_gaussian)
This iteratively resizes the vector over and over. When you know the size of the vector as in your case, it is better to pre-allocate:
gaussians = Vector{Gaussian{Float32}}(undef, 2000)
You can avoid an unnecessary allocation here:
total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
like this:
total = sum(g->intensity(g, 0.75f0, 0.11f0), gaussian_list)

Explanation: sum(f.(x)) first creates the array f.(x), then sums it, while sum(f, x) just applies f to each element before adding it to the sum.

Here's an implementation with benchmarks:

struct Gaussian{T<:Real}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end
Gaussian(x::Real...) = Gaussian(promote(x...)...)

function intensity(model::Gaussian, x::Real, y::Real)
    val = model.A * exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return val
end

function build_array(N=2000)
    return [Gaussian(ntuple(_->rand(Float32), 6)...) for _ in 1:N]
end

Benchmarks (remember to interpolate variables, and avoid global scope):

julia> gaussians = build_array(2000);

julia> @btime sum(intensity.($gaussians, Float32(0.75), Float32(0.11)))
  14.600 μs (1 allocation: 7.94 KiB)
947.5305f0

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  12.600 μs (0 allocations: 0 bytes)
947.5309f0

There's a slight difference in the final sums, since sum of vector uses a numerically superior method called pairwise summation.

Final bonus: Try Tullio.jl, which also uses multithreading. It doesn't make any difference for 2000 elements, but it does for larger arrays (using 12 threads here):

julia> using Tullio, LoopVectorization

julia> gaussians = build_array(200_000);

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  1.228 ms (0 allocations: 0 bytes)
92722.7f0

julia> @btime @tullio s := intensity($gaussians[i], 0.75f0, 0.11f0)
  330.100 μs (197 allocations: 11.53 KiB)
92722.79f0

回复收藏 0 原文

丢了幸福的猪 2025-02-18 13:07:36

不幸的是，我没有时间详细弄清楚这一点，但是我要说的第一件事是检查这是否是一个基准伪影 - gaussians是一个全局变量使用$的基准标准。

至于您的功能，类型注释在此处没有为性能做任何事情，并且会使您的功能不那么合并（例如，您将无法自动进行自动化，从而使您将所有内容都限制为float32 ）。

我要写下它的方式：

function intensity(m, x, y)
    (; x₀, y₀, A, a, b, c) = m # destructuring input
    A * exp( -(a * (x - x₀)^2 + 2b * (x - x₀) * (y - y₀) + c * (y - y₀)^2 ) )
end

随之而来的是：

  231.100 μs (12001 allocations: 195.44 KiB)
  231.500 μs (12001 allocations: 195.44 KiB)
  229.200 μs (12000 allocations: 187.50 KiB)
  229.300 μs (12000 allocations: 187.50 KiB)

它比我的机器上的原始版本快100μs。

I don't have time to figure this one out in detail unfortunately, but the first thing I'd say is check whether this is a benchmarking artifact - gaussians is a global variable which should be interpolated into the benchmark using $.

As to your function, the type annotations are not doing anything for performance here, and will make your function less composable (e.g. you won't be able to autodiff through it give you're restricting everything to Float32).

Here's how I would write it:

function intensity(m, x, y)
    (; x₀, y₀, A, a, b, c) = m # destructuring input
    A * exp( -(a * (x - x₀)^2 + 2b * (x - x₀) * (y - y₀) + c * (y - y₀)^2 ) )
end

With that I'm getting:

  231.100 μs (12001 allocations: 195.44 KiB)
  231.500 μs (12001 allocations: 195.44 KiB)
  229.200 μs (12000 allocations: 187.50 KiB)
  229.300 μs (12000 allocations: 187.50 KiB)

which is about 100μs faster than your original version on my machine.

回复收藏 0 原文

~没有更多了~