关于如何加快距离计算的建议

发布于 2024-10-03 06:34:31 字数 940 浏览 4 评论 0原文

考虑以下类:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

我用它来计算向量的两个元素之间的距离。我基本上为使用此距离度量的向量的每个维度创建该类的一个实例(有些维度使用其他距离度量)。分析显示,此类的 __call__ 函数占据了我的 knn 实现的 90% 的运行时间(谁会想到)。我不认为有任何纯Python方法可以加速这个过程,但也许如果我用C实现它?

如果我运行一个简单的 C 程序,仅使用上面的公式计算随机值的距离,那么它比 Python 快几个数量级。所以我尝试使用 ctypes 并调用执行计算的 C 函数,但显然是转换参数和返回值的代价非常昂贵,因为生成的代码要慢得多。

我当然可以在 C 中实现整个 knn 并直接调用它,但问题是,就像我所描述的那样,我对向量的某些维度使用不同的距离函数,并将它们转换为 C 的工作量太大。

那么我的选择是什么?使用 Python C-API 编写 C 函数会消除开销吗?还有其他方法可以加快计算速度吗?

Consider the following class:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__ function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?

If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.

I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.

So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

空城旧梦 2024-10-10 06:34:31

以下 cython 代码(我意识到 __init__ 的第一行是不同的,我用随机的东西替换了它,因为我不知道 var 并且因为无论如何它都不重要- 你说 __call__ 是瓶颈):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

通过简单的 setup.py 编译(只是 文档中的示例(文件名已更改)),在简单设计的 timeit 基准测试中,它的性能比同等的纯 Python 好近 20 倍。请注意,唯一更改的是 _norm 字段和 __call__ 参数的 cdef。我认为这非常令人印象深刻。

The following cython code (I realize the first line of __init__ is different, I replaced it with random stuff because I don't know var and because it doesn't matter anyway - you stated __call__ is the bottleneck):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

Compiled via a simple setup.py (just the example from the docs with the file name altered), it performs nearly 20 times better than the equivalent pure python in a simple contrieved timeit benchmark. Note that the only changed were cdefs for the _norm field and the __call__ parameters. I consider this pretty impressive.

苏辞 2024-10-10 06:34:31

这可能没有多大帮助,但您可以使用嵌套函数重写它:

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f

This probably won't help much, but you can rewrite it using nested functions:

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文