邻近搜索
应用程序如何执行邻近搜索? 例如,用户输入邮政编码,然后应用程序会按距离排序列出 20 英里内的所有企业。
我想用 PHP 和 MySQL 构建类似的东西。 这种做法正确吗?
- 获取我感兴趣的位置的地址并将其存储在我的数据库中
- 使用 Google 的地理编码服务对所有地址进行地理编码
- 编写包含半正弦公式的数据库查询以进行邻近搜索和排序
可以吗? 在步骤 3 中,我将计算每个查询的邻近度。 是否有一个列出每个企业与一些参考位置之间距离的 PROXIMITY 表会更好?
How does an application perform a proximity search? For example, a user types in a postal code, then the application lists all the businesses within 20 miles ordered by proximity.
I want to build something like that in PHP and MySQL. Is this approach correct?
- Get the addresses for locations I'm interested in and store in my database
- Geocode all the addresses with Google's geocoding service
- Write a database query that includes Haversine formula to do the proximity search and ordering
Is this OK? In step 3, I'm going to calculate the proximity for every query. Is it better to have a PROXIMITY table that lists the distance between every business and a few reference locations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果有足够的记录来保证速度,可以采用以下方法提前对它们建立索引。
定义一个边长约 20 英里的垃圾箱网格。 将垃圾箱编号与每个商店的记录一起存储。 在搜索时,计算与搜索点 20 英里半径相交的所有 bin 的数量。 然后检索任何这些箱中的所有商店,并像以前一样继续。
If there are enough records for speed to matter, here's a way to index them ahead of time.
Define a grid of bins about 20 miles on a side. Store the bin number with each store's record. At search time, compute the numbers of all bins that intersect a 20-mile radius from your search point. Then retrieve all stores in any of those bins, and proceed as before.
我们用它来完成数千个点。 如果您在 SQL 中执行此操作以在纬度和经度列上建立索引,这一点很重要。 我们尝试在 SQL 2008 中使用空间索引执行此操作,但我们确实没有看到我们预期的性能提升。 不过,如果您想在距离邮政编码一定距离内进行计算,您需要考虑是否要使用邮政编码质心或邮政编码的多边形表示。
Haversine forumla 是一个很好的起点。
我们在动态计算距离时没有遇到性能问题,对于某些我们提前知道点并且将有数百万条记录的应用程序,我们确实提前计算了它。
We use this to do many thousands of points. It is important if you are performing this in SQL to have an index on the Latitude and Longitude column. We tried doing this in SQL 2008 with spatial indexes but we really didn't see the performance increase we expected. Though if you want to calculate within a certain distance from a ZIP you need to think about if you are going to use the ZIP centroid or a polygon representation of the ZIP code.
Haversine forumla is a good place to start.
We have not had performance problems calculating the distance on the fly, we do calculate it ahead of time for some applications where we know the points ahead of time and there are going to be millions of records.
我们为大约 1200 个地点执行此操作。 我只是即时使用半正矢公式,但根据您的应用程序,最好将其存储在 PHP 而不是 SQL 中。 (我们的实现是在 .net 中进行的,因此您的情况可能会有所不同)。
事实上,我们实现它的方式的最大缺点是,每个计算(直到最近)都必须在数据层上计算,这非常慢(当我说慢时,我真的指的是非瞬时的,它需要一秒钟左右的时间) ),但这是因为它必须根据提供的邮政编码计算所有 1200 个位置的距离。
根据您选择的路线,可以通过查看经度和纬度并删除预定义范围之外的经度和纬度来加快数字距离计算速度(例如,如果您正在查看 20 英里内的所有地址,则有一个您可以计算出所有地址必须位于 20 英里之外的经度范围。)如果需要,这可以加快您的查询速度。
我们实际上考虑过将所有可能的组合存储在我们的数据库中。 实际上,听起来它可能是一个大型数据存储,但它实际上并不在大范围内。 使用索引,它可以非常快,而且您不必担心算法优化等问题。我们决定反对它,因为我们在 C# 中有方程,它允许我们缓存在 C# 中进行所有计算所需的信息。业务层。 两者都可以正常工作,这只是您的偏好的问题。
We do this for about 1200 locations. I would just use the Haversine formula on the fly although depending on you application, it might be better to store it in PHP instead of SQL. (Our implementation is in .net so your milage may vary).
Really our biggest drawback with the way we implemented it, is that every calculation (up until recently) had to be calculated on the data tier which was painfully slow (when I say slow, I really mean non-instantaneous it took a second or so), but that was due to the fact that it had to calculate the distance for all 1200 locations based on the supplied zip code.
Depending on the route you choose, there are ways of speeding up the number distance calculations, by looking at the longitude and latitude and removing the ones outside of a predefined range (for example if you are looking at all address within 20 miles there is a longitude range you can calculate which all addresses have to fall in to be 20 miles away.) That can speed up you query if need be.
We actually looked at storing all possible combinations in our database. In reality it sounds like it could be a large data store, but it's really not in the big scope of things. With indexes it can be quite fast, and you don't have to worry about algorithm optimization etc. We decided against it, because we had the equation in C#, and it allowed us to cache the information necessary to do all the calculations in the business tier. Either will work just fine, it's just a matter of what your preference is.