皮尔逊相似度评分,我如何进一步优化?
我有一个皮尔逊相似度分数的实现,用于比较两个值字典。此方法花费的时间比其他任何地方都多(可能有数百万次调用),因此这显然是需要优化的关键方法。
即使是最轻微的优化也可能对我的代码产生很大的影响,因此我热衷于探索即使是最小的改进。
这是我到目前为止所拥有的:
def simple_pearson(v1,v2):
si = [val for val in v1 if val in v2]
n = len(si)
if n==0: return 0.0
sum1 = 0.0
sum2 = 0.0
sum1_sq = 0.0
sum2_sq = 0.0
p_sum = 0.0
for v in si:
val_1 = v1[v]
val_2 = v2[v]
sum1+=val_1
sum2+=val_2
sum1_sq+=pow(val_1,2)
sum2_sq+=pow(val_2,2)
p_sum+=val_1*val_2
# Calculate Pearson score
num = p_sum-(sum1*sum2/n)
temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
if temp < 0.0:
temp = -temp
den = sqrt(temp)
if den==0: return 1.0
r = num/den
return r
I have an implemented of Pearson's Similarity score for comparing two dictionaries of values. More time is spent in this method than anywhere else (potentially many millions of calls), so this is clearly the critical method to optimise.
Even the slightest optimisation could have a big impact on my code, so I'm keen to explore even the smallest improvements.
Here's what I have so far:
def simple_pearson(v1,v2):
si = [val for val in v1 if val in v2]
n = len(si)
if n==0: return 0.0
sum1 = 0.0
sum2 = 0.0
sum1_sq = 0.0
sum2_sq = 0.0
p_sum = 0.0
for v in si:
val_1 = v1[v]
val_2 = v2[v]
sum1+=val_1
sum2+=val_2
sum1_sq+=pow(val_1,2)
sum2_sq+=pow(val_2,2)
p_sum+=val_1*val_2
# Calculate Pearson score
num = p_sum-(sum1*sum2/n)
temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
if temp < 0.0:
temp = -temp
den = sqrt(temp)
if den==0: return 1.0
r = num/den
return r
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
真正的速度提升将通过迁移到 numpy 或 scipy 来获得。除此之外,还存在微优化:例如,
x*x
比pow(x,2)
更快;您可以通过执行以下操作与键同时提取值,而不是:类似
然后
等等;这些是否都带来时间优势需要微基准测试。根据您如何使用这些系数,返回平方将为您节省一个开方(这与在几何中使用点之间距离的平方类似,而不是距离本身,并且出于同样的原因 - 为您节省一个开方;这是有道理的,因为系数是一个距离,有点......;-)。
The real speed increase would be gained by moving to numpy or scipy. Short of that, there are microoptimizations: e.g.
x*x
is faster thanpow(x,2)
; you could extract the values at the same time as the keys by doing, instead of:something like
and then
and so on; whether each of these brings time advantage needs microbenchmarking. Depending on how you're using these coefficients returning the square would save you a sqrt (that's a similar idea to using squares of distances between points, in geometry, rather than the distances themselves, and for the same reason -- saves you a sqrt; which makes sense because the coefficient IS a distance, kinda...;-).
Scipy 是最快的!
我已经使用上面的代码以及我在我的计算机上找到的版本进行了一些测试,请参阅下面的结果和代码:
Scipy is the fastest!
I have don some tests with the code above and also with a version I found on my comp, see below for results and the code:
如果您可以使用 scipy,则可以使用 pearson 函数: http ://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#pearsonr
或者您可以从 http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats。 py(搜索
def pearson()
)。在代码中,
np
只是 numpy(代码确实import numpy as np
)。If you can use scipy, you could use the pearson function: http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#pearsonr
Or you could copy/paste the code (it has a liberal license) from http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py (search for
def pearson()
).In the code
np
is just numpy (the code doesimport numpy as np
).我建议更改:
to
do
而不是
,并且值得将最后 6 行替换为:
I'd suggest changing:
to
do
instead of
and it's worth replacing last 6 lines with:
由于看起来您正在进行大量数值计算,因此您应该给出 Psyco一枪。它是一个 JIT 编译器,可以分析正在运行的代码并优化某些操作。安装它,然后在文件顶部放置:
这将启用 Psyco 的 JIT,并且应该会在某种程度上免费加速您的代码:)(实际上不是,它占用更多内存)
Since it looks like you're doing quite a bit of numeric computation you should give Psyco a shot. It's a JIT compiler that analyzes running code and optimizes certain operations. Install it, then at the top of your file put:
This will enable Psyco's JIT and should speed up your code somewhat, for free :) (actually not, it takes up more memory)
如果任何数学函数的输入受到相当大的限制,则可以使用查找表而不是数学函数。这可以为您赢得一些性能(速度),但需要额外的内存来存储表。
If the inputs to any of your math functions are fairly constrained, you can use a lookup table instead of the math function. This can earn you some performance (speed) at the cost of extra memory to store the table.
我不确定这在 Python 中是否成立。但计算 sqrt 是一项处理器密集型计算。
您可能会寻求快速近似 牛顿
I'm not sure if this holds in Python. But calculating the sqrt is a processor intensive calculation.
You might go for a fast approximation newton
我将发布到目前为止我所得到的答案,以将其与问题区分开来。这是上述一些技术的组合,这些技术似乎已经给出了迄今为止最好的改进。
编辑:看起来 psyco 为这个版本提供了 15% 的改进,虽然不是很大,但足以证明它的使用是合理的。
I'll post what I've got so far as an answer to differentiate it from the question. This is a combination of some techniques described above that seem to have given the best improvement s far.
Edit: It looks like psyco gives a 15% improvment for this version which isn't massive but is enough to justify its use.