Python：加速地理比较

发布于 2024-11-19 18:47:29 字数 1180 浏览 9 评论 0原文

我编写了一些包含嵌套循环的代码，其中内部循环执行了大约 150 万次。我在这个循环中有一个函数正在尝试优化。我已经做了一些工作，并得到了一些结果，但我需要一些输入来检查我所做的是否合理。

一些背景：

我有两个地理点集合（纬度、经度），一个相对较小的集合和一个相对较大的集合。对于小集合中的每个点，我需要找到大集合中最近的点。

最明显的方法是使用半正矢公式。这样做的好处是距离绝对准确。

from math import radians, sin, cos, asin, sqrt

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    earth_radius_miles = 3956
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

然而，在我的机器上运行这 150 万次大约需要 9 秒（根据 timeit）。由于准确的距离并不重要，我只需要找到最近的点，所以我决定尝试一些其他功能。

毕达哥拉斯定理的简单实现使我的速度提高了约 30%。认为我可以做得更好，我写了以下内容：

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))

这使我提高了 10 倍。然而，现在我担心这不会保留三角不等式。

所以，我的最后一个问题有两个：我希望有一个运行速度与 dumb 一样快但仍然正确的函数。 dumb 有用吗？如果没有，关于如何改善我的半正矢函数有什么建议吗？

原文

I've written some code that includes a nested loop where the inner loop is executed about 1.5 million times. I have a function in this loop that I'm trying to optimize. I've done some work, and got some results, but I need a little input to check if what I'm doing is sensible.

Some background:

I have two collections of geographic points (latitude, longitude), one relatively small collection and one relatively huge collection. For every point in the small collection, I need to find the closest point in the large collection.

The obvious way to do this would be to use the haversine formula. The benefit here is that the distances are definitely accurate.

from math import radians, sin, cos, asin, sqrt

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    earth_radius_miles = 3956
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

However, running this 1.5 million times takes about 9 seconds on my machine (according to timeit). Since having an accurate distance is unimportant, rather I only need to find the closest point, I decided to try some other functions.

A simple implementation of the pythagorean theorem gives me a speedup of about 30%. Thinking that I can do better, I wrote the following:

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))

which gives me a factor of 10 improvement. However, now I'm worried that this will not preserve the triangle inequality.

So, my final question is two fold: I'd like to have a function that runs as fast as dumb but still be correct. Will dumb work? If not, any suggestions on how to improve my haversine function?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绮筵 2024-11-26 18:47:29

这是 numpy 真正擅长的计算类型。您可以在一次计算中计算单个点与整个数据集之间的距离，而不是循环遍历整个大坐标集。通过我下面的测试，您可以获得一个数量级的速度提升。

这是使用您的 haversine 方法、您的 dumb 方法（不太确定它的作用）和我的 numpy hasrsine 方法进行的一些计时测试。它计算两点之间的距离——一处位于弗吉尼亚州，另一处位于加利福尼亚州，相距 2293 英里。

from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
import itertools

earth_radius_miles = 3956.0

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))
    return d
    
def get_shortest_in(needle, haystack):
    """needle is a single (lat,long) tuple.
        haystack is a numpy array to find the point in
        that has the shortest distance to needle
    """
    dlat = np.radians(haystack[:,0]) - radians(needle[0])
    dlon = np.radians(haystack[:,1]) - radians(needle[1])
    a = np.square(np.sin(dlat/2.0)) + cos(radians(needle[0])) * np.cos(np.radians(haystack[:,0])) * np.square(np.sin(dlon/2.0))
    great_circle_distance = 2 * np.arcsin(np.minimum(np.sqrt(a), np.repeat(1, len(a))))
    d = earth_radius_miles * great_circle_distance
    return np.min(d)
    
    
x = (37.160316546736745, -78.75)
y = (39.095962936305476, -121.2890625)

def dohaversine():
    for i in xrange(100000):
        haversine(x,y)
        
def dodumb():
    for i in xrange(100000):
        dumb(x,y)
        
lots = np.array(list(itertools.repeat(y, 100000)))
def donumpy():
    get_shortest_in(x, lots)

from timeit import Timer
print 'haversine distance =', haversine(x,y), 'time =',
print Timer("dohaversine()", "from __main__ import dohaversine").timeit(100)
print 'dumb distance =', dumb(x,y), 'time =',
print Timer("dodumb()", "from __main__ import dodumb").timeit(100)
print 'numpy distance =', get_shortest_in(x, lots), 'time =',
print Timer("donumpy()", "from __main__ import donumpy").timeit(100)

它打印的内容如下：

haversine distance = 2293.13242188 time = 44.2363960743
dumb distance = 40.6034161104 time = 5.58199882507
numpy distance = 2293.13242188 time = 1.54996609688

numpy 方法需要 1.55 秒来计算与使用函数方法计算需要 44.24 秒相同数量的距离计算。通过将一些 numpy 函数组合到单个语句中，您可能会获得更多的加速，但它会变得很长，难以阅读。

This is the kind of calculation that numpy is really good at. Rather than looping over the entire large set of coordinates, you can compute the distance between a single point and the entire dataset in a single calculation. With my tests below, you can get an order of magnitude speed increase.

Here's some timing tests with your haversine method, your dumb method (not really sure what that does) and my numpy haversine method. It computes the distance between two points - one in Virginia and one in California that are 2293 miles away.

from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
import itertools

earth_radius_miles = 3956.0

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))
    return d
    
def get_shortest_in(needle, haystack):
    """needle is a single (lat,long) tuple.
        haystack is a numpy array to find the point in
        that has the shortest distance to needle
    """
    dlat = np.radians(haystack[:,0]) - radians(needle[0])
    dlon = np.radians(haystack[:,1]) - radians(needle[1])
    a = np.square(np.sin(dlat/2.0)) + cos(radians(needle[0])) * np.cos(np.radians(haystack[:,0])) * np.square(np.sin(dlon/2.0))
    great_circle_distance = 2 * np.arcsin(np.minimum(np.sqrt(a), np.repeat(1, len(a))))
    d = earth_radius_miles * great_circle_distance
    return np.min(d)
    
    
x = (37.160316546736745, -78.75)
y = (39.095962936305476, -121.2890625)

def dohaversine():
    for i in xrange(100000):
        haversine(x,y)
        
def dodumb():
    for i in xrange(100000):
        dumb(x,y)
        
lots = np.array(list(itertools.repeat(y, 100000)))
def donumpy():
    get_shortest_in(x, lots)

from timeit import Timer
print 'haversine distance =', haversine(x,y), 'time =',
print Timer("dohaversine()", "from __main__ import dohaversine").timeit(100)
print 'dumb distance =', dumb(x,y), 'time =',
print Timer("dodumb()", "from __main__ import dodumb").timeit(100)
print 'numpy distance =', get_shortest_in(x, lots), 'time =',
print Timer("donumpy()", "from __main__ import donumpy").timeit(100)

And here's what it prints:

haversine distance = 2293.13242188 time = 44.2363960743
dumb distance = 40.6034161104 time = 5.58199882507
numpy distance = 2293.13242188 time = 1.54996609688

The numpy method takes 1.55 seconds to compute the same number of distance calculations as it takes 44.24 seconds to compute with your function method. You could probably get more of a speedup by combining some of the numpy functions into a single statement, but it would become a long, hard-to-read line.

回复收藏 0 原文

爺獨霸怡葒院 2024-11-26 18:47:29

您可以考虑某种图形哈希，即快速找到最近的点，然后对其进行计算。例如，您可以创建一个统一的网格，并将（大集合的）点分布到网格创建的箱中。

现在，从小集合中获得一个点，您将需要处理更少量的点（即仅相关垃圾箱中的点）

回复收藏 0 原文

も星光 2024-11-26 18:47:29

你写的公式 (d=abs(lat2-lat1)+(lon2-lon1)) 不保留三角不等式：如果你找到 lat，lon ，其中 d 是最小值，你找不到最近的点，而是找到该点最接近您正在检查的点交叉的两条对角直线！

我认为你应该按纬度和经度订购大量的点（这意味着：（1,1），（1,2），（1,3）...（2,1），（2,2）等。
然后使用gunner方法找到一些纬度和经度最接近的点（这应该非常快，它所花费的CPU时间与ln2（n）成正比，其中n是点的数量）。您可以轻松地做到这一点，例如：选择您要检查的点周围 10x10 正方形中的所有点，这意味着：找到纬度从 -10 到 +10 的所有点（枪手方法），然后再次找到lon 中从 -10 到 +10 的那些（炮手方法）。现在您处理的数据量非常小，而且速度应该非常快！

回复收藏 0 原文

笑咖 2024-11-26 18:47:29

abs(lat2 - lat1) + abs(lon2 - lon1) 是 1-范数或曼哈顿度量，因此三角不等式成立。

回复收藏 0 原文

雪花飘飘的天空 2024-11-26 18:47:29

我遇到了类似的问题，并决定使用 Cython 函数。
在我的 2008 MBP 上，它每秒可以进行大约 120 万次迭代。类型检查速度又提高了 25%。毫无疑问，进一步的优化是可能的（以牺牲清晰度为代价）。

您可能还想查看 scipy.spatial.distance.cdist 函数。

from libc.math cimport sin, cos, acos

def distance(float lat1, float lng1, float lat2, float lng2):
    if lat1 is None or lat2 is None or lng1 is None or lng2 is None: return None
    cdef float phi1
    cdef float phi2
    cdef float theta1
    cdef float theta2
    cdef float c
    cdef float arc

    phi1 = (90.0 - lat1)*0.0174532925
    phi2 = (90.0 - lat2)*0.0174532925
    theta1 = lng1*0.0174532925
    theta2 = lng2*0.0174532925

    c = (sin(phi1)*sin(phi2)*cos(theta1 - theta2) + cos(phi1)*cos(phi2))
    arc = acos( c )
    return arc*6371

I had a similar problem and decided to knock up a Cython function.
On my 2008 MBP it can do about 1.2M iterations per second. Taking the type checking out speeds up a further 25%. No doubt further optimisations are possible (at the expense of clarity).

You may also want to check out the scipy.spatial.distance.cdist function.

from libc.math cimport sin, cos, acos

def distance(float lat1, float lng1, float lat2, float lng2):
    if lat1 is None or lat2 is None or lng1 is None or lng2 is None: return None
    cdef float phi1
    cdef float phi2
    cdef float theta1
    cdef float theta2
    cdef float c
    cdef float arc

    phi1 = (90.0 - lat1)*0.0174532925
    phi2 = (90.0 - lat2)*0.0174532925
    theta1 = lng1*0.0174532925
    theta2 = lng2*0.0174532925

    c = (sin(phi1)*sin(phi2)*cos(theta1 - theta2) + cos(phi1)*cos(phi2))
    arc = acos( c )
    return arc*6371

回复收藏 0 原文