在结构阵列上循环时的性能差异
我最近开始使用Julia加快以前用Python编写的代码。我只有对Python的事先经验,所以这是我第一次关心性能,在循环一系列结构时,我发现了一些奇怪的行为。我正在定义一个新的结构高斯,该高斯代表一个2D高斯函数和功能强度(),该功能强度()计算给定位置处函数的幅度:
struct Gaussian{T<:Float32}
x0::T
y0::T
A::T
a::T
b::T
c::T
end
function intensity(
model::Gaussian,
x::Float32,
y::Float32
)
gaussian_value::Float32 = model.A*exp(
-(
model.a * (x - model.x0)^2 +
2 * model.b * (x - model.x0) * (y - model.y0) +
model.c * (y - model.y0)^2
)
)
return gaussian_value
end
然后,我制作了高斯的2000个随机实例的数组:
function build_array()
length = 2000
random_pos = [rand(Float32, (1, 2)) for i in 1:length]
random_A = rand(Float32, (length, 1))
random_a = rand(Float32, (length, 1))
random_b = rand(Float32, (length, 1))
random_c = rand(Float32, (length, 1));
gaussians::Array{Gaussian} = []
for (pos, A, a, b, c) in zip(
random_pos,
random_A,
random_a,
random_b,
random_c
)
new_gaussian = Gaussian(pos..., A, a, b, c)
push!(gaussians, new_gaussian)
end
return gaussians
end
gaussians = build_array()
当我对单个进行基准时 ,调用强度()函数,用1个分配需要〜100 ns(有道理)。我希望对高斯人的阵列进行循环,然后将2000*100 ns = 200。但是,它实际上需要大约两倍的时间:
function total_intensity1(gaussian_list::Array{Gaussian})
total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end
function total_intensity2(gaussian_list::Array{Gaussian})
total::Float32 = 0.
for gaussian in gaussian_list
total += intensity(gaussian, Float32(0.75), Float32(0.11))
end
return total
end
@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
total::Float32 = 0.
for gauss in gaussians
total += intensity(gauss, Float32(0.75), Float32(0.11))
end
total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)
397.700 μs (16004 allocations: 258.02 KiB)
285.800 μs (8980 allocations: 234.06 KiB)
396.100 μs (16002 allocations: 257.95 KiB)
396.700 μs (16001 allocations: 250.02 KiB)
分配的数量也比我预期的要大得多,即使代码几乎相同,第二和第四方法之间也有区别。我的问题:
- 这些差异来自哪里?
- 如何提高代码的性能?
编辑: 作为参考,我最终将代码更改为以下内容:
struct Gaussian
x0::Float32
y0::Float32
A::Float32
a::Float32
b::Float32
c::Float32
end
function build_array()
N = 2000
random_pos = [rand(Float32, (1, 2)) for i in 1:N]
random_A = rand(Float32, N)
random_a = rand(Float32, N)
random_b = rand(Float32, N)
random_c = rand(Float32, N);
gaussians = Gaussian[]
for (pos, A, a, b, c) in zip(
random_pos,
random_A,
random_a,
random_b,
random_c
)
new_gaussian = Gaussian(pos..., A, a, b, c)
push!(gaussians, new_gaussian)
end
return gaussians
end
gaussians = build_array()
function intensity(
model::Gaussian,
x,
y
)
(;x0, y0, A, a, b, c) = model
A*exp(-(a * (x - x0)^2 + 2 * b * (x - x0) * (y - y0) + c * (y - y0)^2))
end
function total_intensity(gaussian_list::Vector{<:Gaussian})
total = sum(g->intensity(g, Float32(0.75), Float32(0.11)), gaussian_list)
end
@btime total_intensity($gaussians)
它运行得更快:
10.900 μs (0 allocations: 0 bytes)
感谢Nils Gudat和DNF的建议!
I have recently started using Julia to speed up some code previously written in Python. I only have prior experience with Python, so this is my first time caring about performance and I have found some strange behavior when looping over an array of structs. I am defining a new struct Gaussian, which represent a 2d Gaussian function and a function intensity() which calculates the amplitude of the function at a given position:
struct Gaussian{T<:Float32}
x0::T
y0::T
A::T
a::T
b::T
c::T
end
function intensity(
model::Gaussian,
x::Float32,
y::Float32
)
gaussian_value::Float32 = model.A*exp(
-(
model.a * (x - model.x0)^2 +
2 * model.b * (x - model.x0) * (y - model.y0) +
model.c * (y - model.y0)^2
)
)
return gaussian_value
end
Then, I make an array of 2000 random instances of Gaussian:
function build_array()
length = 2000
random_pos = [rand(Float32, (1, 2)) for i in 1:length]
random_A = rand(Float32, (length, 1))
random_a = rand(Float32, (length, 1))
random_b = rand(Float32, (length, 1))
random_c = rand(Float32, (length, 1));
gaussians::Array{Gaussian} = []
for (pos, A, a, b, c) in zip(
random_pos,
random_A,
random_a,
random_b,
random_c
)
new_gaussian = Gaussian(pos..., A, a, b, c)
push!(gaussians, new_gaussian)
end
return gaussians
end
gaussians = build_array()
When I benchmark a single call to the intensity() function, it takes ~100 ns with 1 allocation (makes sense). I would expect that looping over the array of Gaussians should then take 2000*100 ns = 200 us. However, it actually takes about twice as long:
function total_intensity1(gaussian_list::Array{Gaussian})
total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end
function total_intensity2(gaussian_list::Array{Gaussian})
total::Float32 = 0.
for gaussian in gaussian_list
total += intensity(gaussian, Float32(0.75), Float32(0.11))
end
return total
end
@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
total::Float32 = 0.
for gauss in gaussians
total += intensity(gauss, Float32(0.75), Float32(0.11))
end
total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)
397.700 μs (16004 allocations: 258.02 KiB)
285.800 μs (8980 allocations: 234.06 KiB)
396.100 μs (16002 allocations: 257.95 KiB)
396.700 μs (16001 allocations: 250.02 KiB)
The number of allocations is also much larger than I would expect and there is a difference between the second and fourth method even though the code is pretty much the same. My questions:
- Where do these differences come from?
- How can I improve the performance of the code?
EDIT:
For reference, I ended up changing my code to the following:
struct Gaussian
x0::Float32
y0::Float32
A::Float32
a::Float32
b::Float32
c::Float32
end
function build_array()
N = 2000
random_pos = [rand(Float32, (1, 2)) for i in 1:N]
random_A = rand(Float32, N)
random_a = rand(Float32, N)
random_b = rand(Float32, N)
random_c = rand(Float32, N);
gaussians = Gaussian[]
for (pos, A, a, b, c) in zip(
random_pos,
random_A,
random_a,
random_b,
random_c
)
new_gaussian = Gaussian(pos..., A, a, b, c)
push!(gaussians, new_gaussian)
end
return gaussians
end
gaussians = build_array()
function intensity(
model::Gaussian,
x,
y
)
(;x0, y0, A, a, b, c) = model
A*exp(-(a * (x - x0)^2 + 2 * b * (x - x0) * (y - y0) + c * (y - y0)^2))
end
function total_intensity(gaussian_list::Vector{<:Gaussian})
total = sum(g->intensity(g, Float32(0.75), Float32(0.11)), gaussian_list)
end
@btime total_intensity($gaussians)
Which runs much faster:
10.900 μs (0 allocations: 0 bytes)
Thank you to Nils Gudat and DNF for their suggestions!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
tldr版本:
vector {gaussian}
应为vector {gaussian {float32}}}
。您的struct定义
高斯{t&lt;:float32}
有些毫无意义。float32
不能具有任何子类型,因此t
can 唯一的是float32
。因此,要么删除限制,要替换为其他东西(例如real
),或者完全取出类型参数。这是不好的:
它创建
vector {any}
,然后将其转换为vector {Gaussian}
。更糟糕的是,vector {Gaussian}
是不是 avector {gaussian {float32}}
。因此,要么删除整个类型参数,要么确保使用它。因此,在这里,不良样式相反,
使用动态语言来做到这一点
,类型属于 values ,而不是变量。
顺便说一句,您必须修改一些功能定义:
应该
还有更多,但这是一个开始。
编辑:好,还有更多内容:
rand(float32,(长度,1))
长度
是基础中的一个超重要功能,因此通常不像这样遮盖它是很好的。并且,制作向量而不是矩阵:rand(float32,(n,1))#这是一个NX1矩阵
rand(float32,n)#这是一个长度 - n vector
推!(高斯,new_gaussian)
迭代地对向量进行了一遍又一遍地的大小。当您知道矢量的大小与您的情况一样,最好预先分配:
gaussians = vector {gaussian {float32}}}(undef,2000)
您可以避免不必要在这里分配:
total = sum(强度。
像这样:
total = sum(g-&gt; intense(g,0.75f0,0.11f0),gaussian_list)
说明:
sum(f。(x))
首先创建数组f。(x)
,然后将其概括,而sum(f,x)
仅将f
应用于每个元素,然后将其添加到总和。这是具有基准测试的实现:
基准(请记住要插值变量,避免全局范围):
最终总和存在略有差异,因为矢量的总和使用了一种称为成对求和的数字上优质方法。
最终奖励:尝试使用多线程的Tullio.jl。对于2000个元素,它没有任何区别,但是对于较大的数组(在此处使用12个线程),它确实可以:
TLDR version:
Vector{Gaussian}
should beVector{Gaussian{Float32}}
.Your struct definition
Gaussian{T<:Float32}
is somewhat nonsensical.Float32
cannot have any subtypes, soT
can only be aFloat32
. Therefore, either remove the restriction, replace it with something else (e.g.Real
), or just take away the type parameter entirely.This is bad:
It creates a
Vector{Any}
which it then converts to aVector{Gaussian}
. Worse,Vector{Gaussian}
is not aVector{Gaussian{Float32}}
. So either remove the whole type parameter, or make sure to use it. So,Same here, bad style
Do this instead
In dynamic languages, types belong to values, not to variables.
BTW, you'll have to modify some of your function definitions:
should be
There's more, but this is a start.
Edit: OK, a few more things:
rand(Float32, (length, 1))
length
is a super important function in Base, so it's normally good not to shadow it like this. And, make vectors instead of matrices:rand(Float32, (N, 1)) # this is an Nx1 matrix
rand(Float32, N) # this is a length-N vector
push!(gaussians, new_gaussian)
This iteratively resizes the vector over and over. When you know the size of the vector as in your case, it is better to pre-allocate:
gaussians = Vector{Gaussian{Float32}}(undef, 2000)
You can avoid an unnecessary allocation here:
total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
like this:
total = sum(g->intensity(g, 0.75f0, 0.11f0), gaussian_list)
Explanation:
sum(f.(x))
first creates the arrayf.(x)
, then sums it, whilesum(f, x)
just appliesf
to each element before adding it to the sum.Here's an implementation with benchmarks:
Benchmarks (remember to interpolate variables, and avoid global scope):
There's a slight difference in the final sums, since sum of vector uses a numerically superior method called pairwise summation.
Final bonus: Try Tullio.jl, which also uses multithreading. It doesn't make any difference for 2000 elements, but it does for larger arrays (using 12 threads here):
不幸的是,我没有时间详细弄清楚这一点,但是我要说的第一件事是检查这是否是一个基准伪影 -
gaussians
是一个全局变量使用$
的基准标准。至于您的功能,类型注释在此处没有为性能做任何事情,并且会使您的功能不那么合并(例如,您将无法自动进行自动化,从而使您将所有内容都限制为
float32
)。我要写下它的方式:
随之而来的是:
它比我的机器上的原始版本快100μs。
I don't have time to figure this one out in detail unfortunately, but the first thing I'd say is check whether this is a benchmarking artifact -
gaussians
is a global variable which should be interpolated into the benchmark using$
.As to your function, the type annotations are not doing anything for performance here, and will make your function less composable (e.g. you won't be able to autodiff through it give you're restricting everything to
Float32
).Here's how I would write it:
With that I'm getting:
which is about 100μs faster than your original version on my machine.