皮尔逊相似度评分,我如何进一步优化?

发布于 2024-08-02 06:05:09 字数 804 浏览 2 评论 0原文

我有一个皮尔逊相似度分数的实现,用于比较两个值字典。此方法花费的时间比其他任何地方都多(可能有数百万次调用),因此这显然是需要优化的关键方法。

即使是最轻微的优化也可能对我的代码产生很大的影响,因此我热衷于探索即使是最小的改进。

这是我到目前为止所拥有的:

def simple_pearson(v1,v2):

    si = [val for val in v1 if val in v2]

    n = len(si)

    if n==0: return 0.0

    sum1 = 0.0
    sum2 = 0.0
    sum1_sq = 0.0
    sum2_sq = 0.0
    p_sum = 0.0

    for v in si:
        val_1 = v1[v]
        val_2 = v2[v]
        sum1+=val_1
        sum2+=val_2
        sum1_sq+=pow(val_1,2)
        sum2_sq+=pow(val_2,2)
        p_sum+=val_1*val_2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
    if temp < 0.0:
        temp = -temp
    den = sqrt(temp)
    if den==0: return 1.0

    r = num/den

    return r

I have an implemented of Pearson's Similarity score for comparing two dictionaries of values. More time is spent in this method than anywhere else (potentially many millions of calls), so this is clearly the critical method to optimise.

Even the slightest optimisation could have a big impact on my code, so I'm keen to explore even the smallest improvements.

Here's what I have so far:

def simple_pearson(v1,v2):

    si = [val for val in v1 if val in v2]

    n = len(si)

    if n==0: return 0.0

    sum1 = 0.0
    sum2 = 0.0
    sum1_sq = 0.0
    sum2_sq = 0.0
    p_sum = 0.0

    for v in si:
        val_1 = v1[v]
        val_2 = v2[v]
        sum1+=val_1
        sum2+=val_2
        sum1_sq+=pow(val_1,2)
        sum2_sq+=pow(val_2,2)
        p_sum+=val_1*val_2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
    if temp < 0.0:
        temp = -temp
    den = sqrt(temp)
    if den==0: return 1.0

    r = num/den

    return r

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

凉城已无爱 2024-08-09 06:05:09

真正的速度提升将通过迁移到 numpy 或 scipy 来获得。除此之外,还存在微优化:例如,x*xpow(x,2) 更快;您可以通过执行以下操作与键同时提取值,而不是:

si = [val for val in v1 if val in v2]

类似

vs = [ (v1[val],v2[val]) for val in v1 if val in v2]

然后

sum1 = sum(x for x, y in vs)

等等;这些是否都带来时间优势需要微基准测试。根据您如何使用这些系数,返回平方将为您节省一个开方(这与在几何中使用点之间距离的平方类似,而不是距离本身,并且出于同样的原因 - 为您节省一个开方;这是有道理的,因为系数是一个距离,有点......;-)。

The real speed increase would be gained by moving to numpy or scipy. Short of that, there are microoptimizations: e.g. x*x is faster than pow(x,2); you could extract the values at the same time as the keys by doing, instead of:

si = [val for val in v1 if val in v2]

something like

vs = [ (v1[val],v2[val]) for val in v1 if val in v2]

and then

sum1 = sum(x for x, y in vs)

and so on; whether each of these brings time advantage needs microbenchmarking. Depending on how you're using these coefficients returning the square would save you a sqrt (that's a similar idea to using squares of distances between points, in geometry, rather than the distances themselves, and for the same reason -- saves you a sqrt; which makes sense because the coefficient IS a distance, kinda...;-).

凉薄对峙 2024-08-09 06:05:09

Scipy 是最快的!

我已经使用上面的代码以及我在我的计算机上找到的版本进行了一些测试,请参阅下面的结果和代码:

pearson 14.7597990757
sim_pearson 15.6806837987
scipy:pearsonr 0.451986019188

try:
    import psyco
    psyco.full()
except ImportError:
    pass

from math import sqrt

def sim_pearson(set1, set2):
    si={}
    for item in set1:
        if item in set2:
            si[item] = 1

    #number of elements
    n = len(si)

    #if none common, return 0 similarity
    if n == 0: return 0

    #add up all the preferences
    sum1 = sum([set1[item] for item in si])
    sum2 = sum([set2[item] for item in si])

    #sum up the squares
    sum_sq1 = sum([pow(set1[item], 2) for item in si])
    sum_sq2 = sum([pow(set2[item], 2) for item in si])

    #sum up the products
    sum_p = sum([set1[item] * set2[item] for item in si])

    nom = sum_p - ((sum1 * sum2) / n )
    den = sqrt( (sum_sq1 - (sum1)**2 / n) * (sum_sq2 - (sum2)**2 / n) )

    if den==0: return 0
    return nom/den



# from http://stackoverflow.com/questions/1307016/pearson-similarity-score-how-can-i-optimise-this-further
def pearson(v1, v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0






if __name__ == "__main__":
    import timeit

    tsetup = """
from random import randrange
from __main__ import pearson, sim_pearson
from scipy.stats import pearsonr
v1 = [randrange(0,1000) for x in range(1000)]
v2 = [randrange(0,1000) for x in range(1000)]
#gc.enable()
"""
    t1 = timeit.Timer(stmt="pearson(v1,v2)", setup=tsetup)
    t2 = timeit.Timer(stmt="sim_pearson(v1,v2)", setup=tsetup)
    t3 = timeit.Timer(stmt="pearsonr(v1,v2)", setup=tsetup)

    tt = 1000

    print 'pearson', t1.timeit(tt)
    print 'sim_pearson', t2.timeit(tt)
    print 'scipy:pearsonr', t3.timeit(tt)

Scipy is the fastest!

I have don some tests with the code above and also with a version I found on my comp, see below for results and the code:

pearson 14.7597990757
sim_pearson 15.6806837987
scipy:pearsonr 0.451986019188

try:
    import psyco
    psyco.full()
except ImportError:
    pass

from math import sqrt

def sim_pearson(set1, set2):
    si={}
    for item in set1:
        if item in set2:
            si[item] = 1

    #number of elements
    n = len(si)

    #if none common, return 0 similarity
    if n == 0: return 0

    #add up all the preferences
    sum1 = sum([set1[item] for item in si])
    sum2 = sum([set2[item] for item in si])

    #sum up the squares
    sum_sq1 = sum([pow(set1[item], 2) for item in si])
    sum_sq2 = sum([pow(set2[item], 2) for item in si])

    #sum up the products
    sum_p = sum([set1[item] * set2[item] for item in si])

    nom = sum_p - ((sum1 * sum2) / n )
    den = sqrt( (sum_sq1 - (sum1)**2 / n) * (sum_sq2 - (sum2)**2 / n) )

    if den==0: return 0
    return nom/den



# from http://stackoverflow.com/questions/1307016/pearson-similarity-score-how-can-i-optimise-this-further
def pearson(v1, v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0






if __name__ == "__main__":
    import timeit

    tsetup = """
from random import randrange
from __main__ import pearson, sim_pearson
from scipy.stats import pearsonr
v1 = [randrange(0,1000) for x in range(1000)]
v2 = [randrange(0,1000) for x in range(1000)]
#gc.enable()
"""
    t1 = timeit.Timer(stmt="pearson(v1,v2)", setup=tsetup)
    t2 = timeit.Timer(stmt="sim_pearson(v1,v2)", setup=tsetup)
    t3 = timeit.Timer(stmt="pearsonr(v1,v2)", setup=tsetup)

    tt = 1000

    print 'pearson', t1.timeit(tt)
    print 'sim_pearson', t2.timeit(tt)
    print 'scipy:pearsonr', t3.timeit(tt)

忘年祭陌 2024-08-09 06:05:09

如果您可以使用 scipy,则可以使用 pearson 函数: http ://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#pearsonr

或者您可以从 http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats。 py(搜索 def pearson())。
在代码中,np 只是 numpy(代码确实import numpy as np)。

If you can use scipy, you could use the pearson function: http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#pearsonr

Or you could copy/paste the code (it has a liberal license) from http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py (search for def pearson()).
In the code np is just numpy (the code does import numpy as np).

止于盛夏 2024-08-09 06:05:09

我建议更改:

[val for val in v1 if val in v2]

to

set(v1) & set(v2)

do

if not n: return 0.0    # and similar for den

而不是

if n == 0: return 0.0

,并且值得将最后 6 行替换为:

try:
    return num / sqrt(abs(temp))
except ZeroDivisionError:
    return 1.0

I'd suggest changing:

[val for val in v1 if val in v2]

to

set(v1) & set(v2)

do

if not n: return 0.0    # and similar for den

instead of

if n == 0: return 0.0

and it's worth replacing last 6 lines with:

try:
    return num / sqrt(abs(temp))
except ZeroDivisionError:
    return 1.0
少跟Wǒ拽 2024-08-09 06:05:09

由于看起来您正在进行大量数值计算,因此您应该给出 Psyco一枪。它是一个 JIT 编译器,可以分析正在运行的代码并优化某些操作。安装它,然后在文件顶部放置:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

这将启用 Psyco 的 JIT,并且应该会在某种程度上免费加速您的代码:)(实际上不是,它占用更多内存)

Since it looks like you're doing quite a bit of numeric computation you should give Psyco a shot. It's a JIT compiler that analyzes running code and optimizes certain operations. Install it, then at the top of your file put:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

This will enable Psyco's JIT and should speed up your code somewhat, for free :) (actually not, it takes up more memory)

断桥再见 2024-08-09 06:05:09

如果任何数学函数的输入受到相当大的限制,则可以使用查找表而不是数学函数。这可以为您赢得一些性能(速度),但需要额外的内存来存储表。

If the inputs to any of your math functions are fairly constrained, you can use a lookup table instead of the math function. This can earn you some performance (speed) at the cost of extra memory to store the table.

能怎样 2024-08-09 06:05:09

我不确定这在 Python 中是否成立。但计算 sqrt 是一项处理器密集型计算。

您可能会寻求快速近似 牛顿

I'm not sure if this holds in Python. But calculating the sqrt is a processor intensive calculation.

You might go for a fast approximation newton

像你 2024-08-09 06:05:09

我将发布到目前为止我所得到的答案,以将其与问题区分开来。这是上述一些技术的组合,这些技术似乎已经给出了迄今为止最好的改进。

def pearson(v1,v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0

编辑:看起来 psyco 为这个版本提供了 15% 的改进,虽然不是很大,但足以证明它的使用是合理的。

I'll post what I've got so far as an answer to differentiate it from the question. This is a combination of some techniques described above that seem to have given the best improvement s far.

def pearson(v1,v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0

Edit: It looks like psyco gives a 15% improvment for this version which isn't massive but is enough to justify its use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文