PHP/MySQL 中的地理搜索(距离)(性能)
我有一个 MySQL 表 (MyISAM),其中包含大约 200k 个纬度/经度对条目,我根据与另一个纬度/经度对的对距离(大圆公式)进行选择。 (例如,50.281852、2.504883 周围 10 公里半径内的所有条目)
我的问题是此查询大约需要 0.28 秒。仅针对这 20 万条条目运行(每天都会增加更多)。而 0.28 秒。通常情况下没问题,这个查询经常运行,因为它为我的网络应用程序的主要功能提供了支持,而且很多时候它是更大查询的一部分。
有什么办法可以加快这个速度吗?显然,MySQL 每次都必须遍历所有 200k 个条目,并对每个条目执行大圆公式。我在 Stack Overflow 上读到了一些有关 geohashing、R-Trees 等的内容,但我不认为这是我想要的方式。部分是因为我从来都不是数学的忠实粉丝,但主要是因为我认为这个问题已经被图书馆/扩展等中比我聪明的人解决了。它已经过广泛测试并定期更新。
MySQL 似乎有一个空间扩展,但不提供距离函数。我应该查看另一个数据库来放入这个坐标对吗? PostgreSQL 似乎有相当成熟的 Spatial 扩展。你知道吗?或者 PostgreSQL 是否也会简单地使用大圆公式来获取某个区域内的所有条目?
是否有专门的独立产品或 mysql 扩展可以满足我的需求?
或者是否有一个 PHP 库可以用来进行计算?使用 APC,我可以轻松地将经纬度对放入内存中(这 200k 条目大约需要 5MB),然后在 PHP 中运行查询。然而,这种方法的问题是,然后我会有一个像 SELECT .. FROM .. WHERE id in (id1, id2, ..) 这样的 MySQL 查询来获取最多可达几千个的所有结果。 MySQL 处理此类查询的能力如何?然后(因为这是一项数字运算任务)在 PHP 中执行此操作是否足够快?
还有其他想法我应该/不应该做什么吗?
为了完整起见,这里是示例查询,删除了任何不相关的部分(正如我所说,通常这是我连接多个表的更大查询的一部分):
SELECT id,
6371 * acos( sin( radians( 52.4042924 ) ) * sin( radians( lat ) ) + cos( radians( 50.281852 ) ) * cos( radians( lat ) ) * cos( radians( 2.504883 ) - radians( lon ) ) ) AS dst
FROM geoloc
HAVING dst <10
ORDER BY dst ASC
I have a MySQL-table (MyISAM) containing about 200k entries of lat/long pairs that I select from, based on the pairs distance (great circle formula) from another lat/long pair. (e.g. all entries that are within a 10km radius around 50.281852, 2.504883)
My problem is that this query takes about 0,28 sec. to run just for those 200k entries (which continue to get more every day). While 0,28 sec. would be fine normally, this query runs very often as it powers the main feature of my web-app, and often times it's part of a larger query.
Is there any way to speed this up? Obviously MySQL has to run through all 200k entries every time and perform the great circle formula for every entry. I read something about geohashing, R-Trees and the like here on Stack Overflow but I don't think that's the way I want to go. Partly because I've never been a big fan of maths, but mostly because I think that this problem has already been solved by someone smarter than me in a library/extension/etc. that has been tested extensively and is being updated regularly.
MySQL seems to have a spatial extension but that one doesn't provide a distance function. Should I be looking at another database to put this coordinate-pairs in? PostgreSQL seems to have a fairly mature Spatial extension. Do you know anything about it? Or would PostgreSQL too simply just use the great circle formula to get all entries within a certain region?
Is there maybe a specialized stand-alone product or mysql-extension that already does what I'm looking for?
Or is there maybe A PHP library I could use to do the calculations? Using APC I could easily fit the lat-long pairs into memory (those 200k entries take about 5MB) and then run the query inside of PHP. The problem with this approach however is that then I'd have a MySQL query like SELECT .. FROM .. WHERE id in (id1, id2, ..) for all the results which can be up to a few thousand. How well does MySQL handle Queries like these? And then (since this is a number-crunching task) would doing this in PHP be fast enough?
Any other Ideas what I should/shouldn't do?
For completeness, here is the sample query, stripped of any irrelevant parts (as I said, usually this is part of a bigger query where I join multiple tables):
SELECT id,
6371 * acos( sin( radians( 52.4042924 ) ) * sin( radians( lat ) ) + cos( radians( 50.281852 ) ) * cos( radians( lat ) ) * cos( radians( 2.504883 ) - radians( lon ) ) ) AS dst
FROM geoloc
HAVING dst <10
ORDER BY dst ASC
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
计算边界框以选择 SQL 查询的 WHERE 子句中的行子集,以便您仅对该行子集执行昂贵的距离计算,而不是针对表中的全部 200k 记录执行昂贵的距离计算。该方法在关于 Movable Type 的文章中进行了描述(使用 PHP代码示例)。然后,您可以在针对该子集的查询中包含半正矢计算,以计算实际距离,并在该点考虑 HAVING 子句。
边界框有助于提高性能,因为这意味着您只需对一小部分数据进行昂贵的距离计算。这实际上与 Patrick 建议的方法相同,但 Movable Type 链接对该方法以及可用于构建边界框和 SQL 查询的 PHP 代码进行了广泛的解释。
编辑
如果您认为半正矢不够准确,那么还有 Vincenty 公式。
Calculate a bounding box to select a subset of the rows in the WHERE clause of your SQL query, so that you're only executing the expensive distance calculation on that subset of rows rather than against the entire 200k records in your table. The method is described in this article on Movable Type (with PHP code examples). Then you can include the Haversine calculation in your query against that subset to calculate the actual distances, and factor in the HAVING clause at that point.
It's the bounding box that helps your performance, because it means you're only doing the expensive distance calculation on a small subset of your data. This is effectively the same method that Patrick has suggested, but the Movable Type link has extensive explanations of the method, as well as PHP code that you can use to build the bounding box and your SQL query.
EDIT
If you don't think haversine is accurate enough, then there's also the Vincenty formula.
如果您从不同的角度处理问题会怎样?
直线 10 公里是:
以此为基础,做一些快速数学运算,然后在查询中添加到 < code>WHERE 子句删除“框”之外的任何位置,该“框”是通过添加缓冲区(假设纬度为 1')而创建的。 6 英尺长
使用此图像:
您正在搜索的 GPS 位置 (34° 12' 34.0", -85° 1' 1.0") [34.2094444444, -85.0169444444]
您找到最小/最大纬度/经度
2a。最小纬度 - 34.1927777778,-85.0169444444
2b。分钟经度 - 34.2094444444,-85.1169444444
2c。最大纬度 - 34.2261111111,-85.0169444444
2d。最大经度 - 34.2094444444,-84.9169444444
使用每个方向的最小值和最大值运行查询
<前><代码>选择*
来自地质定位
在哪里
纬度 >= 34.1927777 并且
纬度 <= 34.2261111 并且
长 >= -85.1169444 并且
长 <= -84.9169444;
您可以将距离计算与 SQL 查询集成,也可以使用 PHP 库/类在提取数据后运行距离检查。无论哪种方式,您都减少了很大比例的计算次数。
我使用以下函数来计算两个 US84 GPS 位置之间的距离。传递两个参数,每个参数都是一个数组,第一个元素是纬度,第二个元素是经度。我相信它的精度可达几英尺,这对于除了最铁杆的 GPS 爱好者之外的所有人来说应该足够了。另外,我相信这使用了半正矢距离公式。
$距离 = 计算GPS距离(数组(34.32343, -86.342343), 数组(34.433223, -96.0032344));
更新
我忘了提及,我的距离函数将返回以英尺为单位的距离。
What if you approach the problem from a different angle?
10 km in a straight line is:
Using this as a basis, do some quick math and in your query add to the
WHERE
clause removing any locations that are outside the 'box' that is created by adding the buffer zone with the assumption of 1' lat & 6' longWorking from this image:
GPS location you are searching for (34° 12' 34.0", -85° 1' 1.0") [34.2094444444, -85.0169444444]
You find the min/max latitude/longitude
2a. Min Latitude - 34.1927777778, -85.0169444444
2b. Min Longitude - 34.2094444444, -85.1169444444
2c. Max Latitude - 34.2261111111, -85.0169444444
2d. Max Longitude - 34.2094444444, -84.9169444444
Run your query with the min and max of each direction
You can either integrate the distance calculation with the SQL query or you can use a PHP library/class to run the distance check after pulling the data. Either way you have reduced the number of calculations by a large percentage.
I use the following function to calculate the distance between two US84 GPS locations. Two parameters are passed, each parameter is an array with the first element being the latitude and the second element being the longitude. I believe it has an accuracy to a few feet, which should be enough for all but the hardest core GPS-ophiles. Also, I believe this uses the Haversine distance formula.
$distance = calculateGPSDistance(array(34.32343, -86.342343), array(34.433223, -96.0032344));
UPDATE
I forgot to mention, my distance function will return distance in feet.
到目前为止我所做的就是上面@Mark 所描述的。我想对于小型网站来说这是一个可行的解决方案,只是对我的情况不太好(200k 条目本地化在以给定点为中心的某个 100x100 平方公里的盒子内。我正在使用 Mark 的相同技巧,但性能太差了。5 个用户/第二次查询附近的纬度/经度点几个小时,查询开始需要长达 10 - 15 秒的时间;这种情况发生在我调整了 my.cnf 中的 mySQL 设置之后。想想当全球有 200 万个条目时会发生什么
所以,现在是第 2 步:希尔伯特曲线。 。
它应该通过在一列(hilbert_number)上仅使用一个索引来解决(lat,lon)列上的B树索引是浪费的问题(在范围扫描中,仅使用B树索引的一部分)。 hilbert_number 是根据希尔伯特曲线上点的纬度/经度坐标计算得出的数字。
但第二个问题仍然存在,即通过半正矢公式测试固定点与先前结果子集中的所有内容之间的距离。这部分可能会非常慢。因此,我正在考虑以某种方式更直接地测试距离,将所有内容放在希尔伯特曲线上,并对结果子集应用一些位掩码,而不是应用半正弦公式。我只是不知道该怎么做...
无论如何,我用来减少结果子集中点数的另一个技巧是使用两个边界框,并在子集中仅包含灰色/白色点进一步的半正矢测试:
我现在需要做的是切换到希尔伯特数并看看它的行为。但我怀疑这是否会提高 10 倍的性能!
What I was doing till now is just as @Mark described above. A viable solution for small sites I guess, only not that good for my case (200k entries localized inside some 100x100 square km box centered around a given point. I was using this same trick of Mark's but performance is just too poor. 5 users/second querying for nearby lat/lon points for few hours and the queries start taking up to 10 - 15 seconds; and this happens after I have adjusted mySQL settings in my.cnf. Don't even want to think about what would happen when there will be 2 million entries worldwide.
So, now time for step 2: Hilbert curve.
It should solve the problem of B-tree index on (lat, lon) columns which is wasteful (onrange scans, ony one part of the B-tree index is being used) by employing just one index on one column (hilbert_number). hilbert_number being a number calculated based on a point's lat/lon coordinates on the Hilbert curve.
But the second problem, of testing the distance between fixed point and everything from the previous result subset through the Haversine formula remains. That part can be very slow. So I was thinking about somehow testing for distance more directly, putting everything on the hilbert curve and applying some bitmask to that result subset instead of applying the Haversine formula. I just don't know how would I go about that...
Anyway, another trick I have employed to reduce the number of points in the result subset was to use two bounding boxes and include in the subset only the gray / white points for further Haversine testing:
What I need to do right now is switch to Hilbert numbers and see how it behaves. But I doubt this is going to increase 10x the performance!
你可以试试四键。它是一个空间索引并减少维度。它将地图细分为图块,但您可以使用它来存储点。您可以在 @ phpclasses.org 下载我的 php 类 hilbert-curve。它还包括 z 曲线和摩尔曲线。重要的是要知道它使用墨卡托投影。您可以查找 Bing 地图平铺。它解释了如何使用四键。您需要 x,y 坐标和 z(缩放或深度)值。然后它会给你一个四键。
You could try a quadkey. It's a spatial index and reduce the dimension. It subdivide a map into tiles but you can use it to store points. You can download my php class hilbert-curve @ phpclasses.org. It also includes a z-curve and a moore-curve. Important is to know it uses a mercator projection. You can look for Bing maps tiling. It explains how to use a quadkey. You need x,y coordinate and z (zoom or depth) value. Then it gives you a quadkey.