关于如何加快距离计算的建议

发布于 2024-10-03 06:34:31 字数 940 浏览 4 评论 0原文

考虑以下类：

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

我用它来计算向量的两个元素之间的距离。我基本上为使用此距离度量的向量的每个维度创建该类的一个实例（有些维度使用其他距离度量）。分析显示，此类的 __call__ 函数占据了我的 knn 实现的 90% 的运行时间（谁会想到）。我不认为有任何纯Python方法可以加速这个过程，但也许如果我用C实现它？

如果我运行一个简单的 C 程序，仅使用上面的公式计算随机值的距离，那么它比 Python 快几个数量级。所以我尝试使用 ctypes 并调用执行计算的 C 函数，但显然是转换参数和返回值的代价非常昂贵，因为生成的代码要慢得多。

我当然可以在 C 中实现整个 knn 并直接调用它，但问题是，就像我所描述的那样，我对向量的某些维度使用不同的距离函数，并将它们转换为 C 的工作量太大。

那么我的选择是什么？使用 Python C-API 编写 C 函数会消除开销吗？还有其他方法可以加快计算速度吗？

原文

Consider the following class:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__ function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?

If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.

I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.

So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空城旧梦 2024-10-10 06:34:31

以下 cython 代码（我意识到 __init__ 的第一行是不同的，我用随机的东西替换了它，因为我不知道 var 并且因为无论如何它都不重要- 你说 __call__ 是瓶颈）：

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

通过简单的 setup.py 编译（只是文档中的示例（文件名已更改）），在简单设计的 timeit 基准测试中，它的性能比同等的纯 Python 好近 20 倍。请注意，唯一更改的是 _norm 字段和 __call__ 参数的 cdef。我认为这非常令人印象深刻。

The following cython code (I realize the first line of __init__ is different, I replaced it with random stuff because I don't know var and because it doesn't matter anyway - you stated __call__ is the bottleneck):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

Compiled via a simple setup.py (just the example from the docs with the file name altered), it performs nearly 20 times better than the equivalent pure python in a simple contrieved timeit benchmark. Note that the only changed were cdefs for the _norm field and the __call__ parameters. I consider this pretty impressive.

回复收藏 0 原文

苏辞 2024-10-10 06:34:31

这可能没有多大帮助，但您可以使用嵌套函数重写它：

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f

This probably won't help much, but you can rewrite it using nested functions:

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f

回复收藏 0 原文

~没有更多了~