Python:加速地理比较
我编写了一些包含嵌套循环的代码,其中内部循环执行了大约 150 万次。我在这个循环中有一个函数正在尝试优化。我已经做了一些工作,并得到了一些结果,但我需要一些输入来检查我所做的是否合理。
一些背景:
我有两个地理点集合(纬度、经度),一个相对较小的集合和一个相对较大的集合。对于小集合中的每个点,我需要找到大集合中最近的点。
最明显的方法是使用半正矢公式。这样做的好处是距离绝对准确。
from math import radians, sin, cos, asin, sqrt
def haversine(point1, point2):
"""Gives the distance between two points on earth.
"""
earth_radius_miles = 3956
lat1, lon1 = (radians(coord) for coord in point1)
lat2, lon2 = (radians(coord) for coord in point2)
dlat, dlon = (lat2 - lat1, lon2 - lon1)
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
great_circle_distance = 2 * asin(min(1,sqrt(a)))
d = earth_radius_miles * great_circle_distance
return d
然而,在我的机器上运行这 150 万次大约需要 9 秒(根据 timeit)。由于准确的距离并不重要,我只需要找到最近的点,所以我决定尝试一些其他功能。
毕达哥拉斯定理的简单实现使我的速度提高了约 30%。认为我可以做得更好,我写了以下内容:
def dumb(point1, point2):
lat1, lon1 = point1
lat2, lon2 = point2
d = abs((lat2 - lat1) + (lon2 - lon1))
这使我提高了 10 倍。然而,现在我担心这不会保留三角不等式。
所以,我的最后一个问题有两个:我希望有一个运行速度与 dumb 一样快但仍然正确的函数。 dumb
有用吗?如果没有,关于如何改善我的半正矢函数有什么建议吗?
I've written some code that includes a nested loop where the inner loop is executed about 1.5 million times. I have a function in this loop that I'm trying to optimize. I've done some work, and got some results, but I need a little input to check if what I'm doing is sensible.
Some background:
I have two collections of geographic points (latitude, longitude), one relatively small collection and one relatively huge collection. For every point in the small collection, I need to find the closest point in the large collection.
The obvious way to do this would be to use the haversine formula. The benefit here is that the distances are definitely accurate.
from math import radians, sin, cos, asin, sqrt
def haversine(point1, point2):
"""Gives the distance between two points on earth.
"""
earth_radius_miles = 3956
lat1, lon1 = (radians(coord) for coord in point1)
lat2, lon2 = (radians(coord) for coord in point2)
dlat, dlon = (lat2 - lat1, lon2 - lon1)
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
great_circle_distance = 2 * asin(min(1,sqrt(a)))
d = earth_radius_miles * great_circle_distance
return d
However, running this 1.5 million times takes about 9 seconds on my machine (according to timeit). Since having an accurate distance is unimportant, rather I only need to find the closest point, I decided to try some other functions.
A simple implementation of the pythagorean theorem gives me a speedup of about 30%. Thinking that I can do better, I wrote the following:
def dumb(point1, point2):
lat1, lon1 = point1
lat2, lon2 = point2
d = abs((lat2 - lat1) + (lon2 - lon1))
which gives me a factor of 10 improvement. However, now I'm worried that this will not preserve the triangle inequality.
So, my final question is two fold: I'd like to have a function that runs as fast as dumb
but still be correct. Will dumb
work? If not, any suggestions on how to improve my haversine function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是 numpy 真正擅长的计算类型。您可以在一次计算中计算单个点与整个数据集之间的距离,而不是循环遍历整个大坐标集。通过我下面的测试,您可以获得一个数量级的速度提升。
这是使用您的
haversine
方法、您的dumb
方法(不太确定它的作用)和我的 numpy hasrsine 方法进行的一些计时测试。它计算两点之间的距离——一处位于弗吉尼亚州,另一处位于加利福尼亚州,相距 2293 英里。它打印的内容如下:
numpy 方法需要 1.55 秒来计算与使用函数方法计算需要 44.24 秒相同数量的距离计算。通过将一些 numpy 函数组合到单个语句中,您可能会获得更多的加速,但它会变得很长,难以阅读。
This is the kind of calculation that numpy is really good at. Rather than looping over the entire large set of coordinates, you can compute the distance between a single point and the entire dataset in a single calculation. With my tests below, you can get an order of magnitude speed increase.
Here's some timing tests with your
haversine
method, yourdumb
method (not really sure what that does) and my numpy haversine method. It computes the distance between two points - one in Virginia and one in California that are 2293 miles away.And here's what it prints:
The numpy method takes 1.55 seconds to compute the same number of distance calculations as it takes 44.24 seconds to compute with your function method. You could probably get more of a speedup by combining some of the numpy functions into a single statement, but it would become a long, hard-to-read line.
您可以考虑某种图形哈希,即快速找到最近的点,然后对其进行计算。例如,您可以创建一个统一的网格,并将(大集合的)点分布到网格创建的箱中。
现在,从小集合中获得一个点,您将需要处理更少量的点(即仅相关垃圾箱中的点)
You can consider some kind of graphical hashing, i.e. find closest points fast and then calculate on them. For example, you can create a uniform grid, and distribute the points (of the large collection) to be in the bins created by the grid.
Now, having a point from the small collection, you'll need to process much smaller amount of points (i.e. those in relevant bins only)
你写的公式 (d=abs(lat2-lat1)+(lon2-lon1)) 不保留三角不等式:如果你找到 lat,lon ,其中 d 是最小值,你找不到最近的点,而是找到该点最接近您正在检查的点交叉的两条对角直线!
我认为你应该按纬度和经度订购大量的点(这意味着:(1,1),(1,2),(1,3)...(2,1),(2,2)等。
然后使用gunner方法找到一些纬度和经度最接近的点(这应该非常快,它所花费的CPU时间与ln2(n)成正比,其中n是点的数量)。您可以轻松地做到这一点,例如:选择您要检查的点周围 10x10 正方形中的所有点,这意味着:找到纬度从 -10 到 +10 的所有点(枪手方法),然后再次找到lon 中从 -10 到 +10 的那些(炮手方法)。现在您处理的数据量非常小,而且速度应该非常快!
The formula you wrote (d=abs(lat2-lat1)+(lon2-lon1)) does NOT preserve triangle inequality: if you find lat, lon for wich d is min, you don't find the closest point, but the point closest to two diagonal straight lines crossing in the point you are checking!
I think you should order the large ammount of points by lat and lon (this means: (1,1),(1,2), (1,3)...(2,1),(2,2) etc.
Then use the gunner method to find the some of the closest points in terms of latitude and longitude (this should be really fast, it is going to take cpu time proportional to ln2 (n) where n is the number of points). You can do this easily, on example: choose all the points in a square of 10x10 around the point you are going to check, this means: find all the points that are from -10 to +10 in lat (gunner method) and again those that are from -10 to +10 in lon (gunner method). Now you have a really small ammount of data do process, and it should be very fast!
abs(lat2 - lat1) + abs(lon2 - lon1)
是 1-范数或曼哈顿度量,因此三角不等式成立。abs(lat2 - lat1) + abs(lon2 - lon1)
is the 1-norm or Manhattan-metric and thus the triangle inequality holds.我遇到了类似的问题,并决定使用 Cython 函数。
在我的 2008 MBP 上,它每秒可以进行大约 120 万次迭代。类型检查速度又提高了 25%。毫无疑问,进一步的优化是可能的(以牺牲清晰度为代价)。
您可能还想查看 scipy.spatial.distance.cdist 函数。
I had a similar problem and decided to knock up a Cython function.
On my 2008 MBP it can do about 1.2M iterations per second. Taking the type checking out speeds up a further 25%. No doubt further optimisations are possible (at the expense of clarity).
You may also want to check out the
scipy.spatial.distance.cdist
function.做到这一点的最快方法是避免计算每对点的函数,假设您相对较小的集合不是很小。
您可以使用一些具有地理索引的数据库(mysql、oracle、mongodb ..),或者自己实现一些东西。
您可以使用 python-geohash< /a>.对于较小集合中的每个文档,您需要快速找到较大集合中共享来自
geohash.neighbors
的哈希值的文档集,以获得匹配的最长哈希值。您需要使用适当的数据结构进行查找,否则速度会很慢。为了找到点之间的距离,简单方法的误差随着点之间的距离的增加而增加,并且还取决于纬度。请参阅http://www.movable-type.co.uk例如 /scripts/gis-faq-5.1.html。
The fastest way to do this is to avoid computing a function for each pair of points, assuming your relatively small collection isn't very tiny.
There are some databases that have geo-indexes you could use (mysql, oracle, mongodb..), or implement something yourself.
You could use python-geohash. For each doc in the smaller collection you need to quickly find the set of documents in the larger collection that share a hash from
geohash.neighbors
for the longest hash size that has matches. You'll need to use an appropriate datastructure for the lookup or this will be slow.For finding the distance between points, the error of the simple approach increases as the distance between the points increases and also depends on the latitude. See http://www.movable-type.co.uk/scripts/gis-faq-5.1.html for example.