计算邮政编码...和用户之间的距离。
这更像是一个挑战问题,而不是我迫切需要的东西,所以不要花一整天的时间在上面。
我在 2000 年左右建立了一个约会网站(早已不复存在),其中一个挑战是计算用户之间的距离,以便我们可以在 X 英里半径内呈现您的“匹配项”。为了说明问题,给出以下数据库模式(大致):
USER TABLE 用户身份 用户名 邮政编码
邮政编码表 邮政编码 纬度 经度
USER 和 ZIPCODE 在 USER.ZipCode = ZIPCODE.ZipCode 上连接。
您将采用什么方法来回答以下问题:哪些其他用户居住在距离给定用户邮政编码 X 英里的邮政编码中。
我们使用了2000 年人口普查数据,其中包含邮政编码及其对应的表格大致的纬度和经度。
我们还使用 Haversine 公式 来计算球体上任意两点之间的距离...非常简单数学真的。
至少对于我们来说,作为 19 岁的大学生,问题实际上变成了如何有效地计算和/存储所有成员到所有其他成员的距离。一种方法(我们使用的方法)是导入所有数据并计算每个邮政编码到每个其他邮政编码的距离。然后您将存储结果并为其建立索引。类似于:
SELECT User.UserId
FROM ZipCode AS MyZipCode
INNER JOIN ZipDistance ON MyZipCode.ZipCode = ZipDistance.MyZipCode
INNER JOIN ZipCode AS TheirZipCode ON ZipDistance.OtherZipCode = TheirZipCode.ZipCode
INNER JOIN User AS User ON TheirZipCode.ZipCode = User.ZipCode
WHERE ( MyZipCode.ZipCode = 75044 )
AND ( ZipDistance.Distance < 50 )
当然,问题是 ZipDistance 表中将包含很多行。它并不是完全不可行,但它确实很大。它还需要对整个数据集进行完整的前期工作,这也不是无法管理,但不一定是可取的。
不管怎样,我想知道你们中的一些专家可能会采取什么方法来处理这样的事情。另外,我认为这是程序员必须不时解决的一个常见问题,特别是当您考虑算法相似的问题时。我对一个彻底的解决方案感兴趣,其中至少包括所有方面的提示,以便真正快速高效地完成此任务。谢谢!
This is more of a challenge question than something I urgently need, so don't spend all day on it guys.
I built a dating site (long gone) back in 2000 or so, and one of the challenges was calculating the distance between users so we could present your "matches" within an X mile radius. To just state the problem, given the following database schema (roughly):
USER TABLE
UserId
UserName
ZipCode
ZIPCODE TABLE
ZipCode
Latitude
Longitude
With USER and ZIPCODE being joined on USER.ZipCode = ZIPCODE.ZipCode.
What approach would you take to answer the following question: What other users live in Zip Codes that are within X miles of a given user's Zip Code.
We used the 2000 census data, which has tables for zip codes and their approximate lattitude and longitude.
We also used the Haversine Formula to calculate distances between any two points on a sphere... pretty simple math really.
The question, at least for us, being the 19 year old college students we were, really became how to efficiently calculate and/store distances from all members to all other members. One approach (the one we used) would be to import all the data and calculate the distance FROM every zip code TO every other zip code. Then you'd store and index the results. Something like:
SELECT User.UserId
FROM ZipCode AS MyZipCode
INNER JOIN ZipDistance ON MyZipCode.ZipCode = ZipDistance.MyZipCode
INNER JOIN ZipCode AS TheirZipCode ON ZipDistance.OtherZipCode = TheirZipCode.ZipCode
INNER JOIN User AS User ON TheirZipCode.ZipCode = User.ZipCode
WHERE ( MyZipCode.ZipCode = 75044 )
AND ( ZipDistance.Distance < 50 )
The problem, of course, is that the ZipDistance table is going to have a LOT of rows in it. It isn't completely unworkable, but it is really big. Also it requires complete pre-work on the whole data set, which is also not unmanageable, but not necessarily desireable.
Anyway, I was wondering what approach some of you gurus might take on something like this. Also, I think this is a common issue programmers have to tackle from time to time, especially if you consider problems that are just algorithmically similar. I'm interested in a thorough solution that includes at least HINTS on all the pieces to do this really quickly end efficiently. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
好吧,对于初学者来说,您实际上并不需要在这里使用半正矢公式。对于较远的距离,不太准确的公式会产生较大的误差,您的用户不会关心匹配是否正负几英里,而对于较近的距离,误差非常小。 地理距离 维基百科文章中列出了更容易(计算)的公式。
由于邮政编码并不是均匀分布的,因此任何将它们均匀划分的过程都会在它们紧密聚集的区域中受到严重影响(华盛顿特区附近的东海岸就是一个很好的例子)。如果您想要直观比较,请查看 http://benfry.com/zipdecode 并将邮政编码前缀 89 与07.
处理此空间索引的更好方法是使用像 Quadtree 这样的数据结构或R-tree。这种结构允许您对不均匀分布的数据进行空间和距离搜索。
四叉树如下所示:
要搜索它,您可以使用较小单元格的索引向下钻取每个较大单元格就在其中。维基百科解释得更彻底。
当然,由于这是一件相当常见的事情,因此其他人已经为您完成了最困难的部分。由于您尚未指定所使用的数据库,因此 PostgreSQL 扩展 PostGIS 将充当例子。 PostGIS 包含执行 R 树空间索引的功能,使您可以进行高效的空间查询。
导入数据并构建空间索引后,查询距离的查询类似于:
我将让您自己完成本教程的其余部分。
以下是一些其他入门参考。
Ok, for starters, you don't really need to use the Haversine formula here. For large distances where a less accurate formula produces a larger error, your users don't care if the match is plus or minus a few miles, and for closer distances, the error is very small. There are easier (to calculate) formulas listed on the Geographical Distance Wikipedia article.
Since zip codes are nothing like evenly spaced, any process that partitions them evenly is going to suffer mightily in areas where they are clustered tightly (east coast near DC being a good example). If you want a visual comparison, check out http://benfry.com/zipdecode and compare the zipcode prefix 89 with 07.
A far better way to deal with indexing this space is to use a data structure like a Quadtree or an R-tree. This structure allows you to do spatial and distance searches over data which is not evenly spaced.
Here's what an Quadtree looks like:
To search over it, you drill down through each larger cell using the index of smaller cells that are within it. Wikipedia explains it more thoroughly.
Of course, since this is a fairly common thing to do, someone else has already done the hard part for you. Since you haven't specified what database you're using, the PostgreSQL extension PostGIS will serve as an example. PostGIS includes the ability to do R-tree spatial indexes which allow you to do efficient spatial querying.
Once you've imported your data and built the spatial index, querying for distance is a query like:
I'll let you work through the rest of the tutorial yourself.
Here are some other references to get you started.
我只需创建一个 zip_code_distances 表并预先计算美国所有 42K 邮政编码之间的距离,这些邮政编码之间的距离在 20-25 英里半径内。
仅包含彼此半径 20-25 英里范围内的邮政编码可以减少需要在距离表中存储的行数,从最多 17 亿 (42K ^ 2) - 42K 减少到更易于管理的 400 万左右。
我从网上下载了一个邮政编码数据文件,其中包含 csv 格式的所有美国官方邮政编码的经度和纬度:
我编写了一个快速而肮脏的 C# 程序来读取该文件并计算每个邮政编码之间的距离,但仅输出属于该范围内的邮政编码25 英里半径:
生成的输出文件如下所示:
然后,我只需使用 load data infile 将此距离数据加载到我的 zip_code_distances 表中,然后使用它来限制应用程序的搜索空间。
例如,如果您有一个邮政编码为 91210 的用户,并且他们想要查找其半径 10 英里内的人,那么您现在可以简单地执行以下操作:
希望这有帮助
编辑:将半径扩展到 100 英里,这增加了邮政编码距离 3250 万行。
邮政编码 91210 运行时的快速性能检查 0.009 秒。
I'd simply just create a zip_code_distances table and pre-compute the distances between all 42K zipcodes in the US which are within a 20-25 mile radius of each other.
Only including zipcodes within a 20-25 miles radius of each other reduces the number of rows you need to store in the distance table from it's maximum of 1.7 billion (42K ^ 2) - 42K to a much more manageable 4 million or so.
I downloaded a zipcode datafile from the web which contained the longitudes and latitudes of all the official US zipcodes in csv format:
I wrote a quick and dirty C# program to read the file and compute the distances between every zipcode but only output zipcodes that fall within a 25 mile radius:
The resultant output file looks as follows:
I would then just load this distance data into my zip_code_distances table using load data infile and then use it to limit the search space of my application.
For example if you have a user whose zipcode is 91210 and they want to find people who are within a 10 mile radius of them then you can now simply do the following:
Hope this helps
EDIT: extended radius to 100 miles which increased the number of zipcode distances to 32.5 million rows.
quick performance check for zipcode 91210 runtime 0.009 seconds.
您可以通过假设一个盒子而不是圆形半径来简化计算。然后,在搜索时,您只需计算给定点+“半径”的纬度/经度的下/上界,只要您在纬度/经度列上有索引,您就可以很容易地拉回落在框中的所有记录。
You could shortcut the calculation by just assuming a box instead of a circular radius. Then when searching you simply calculate the lower/upper bound of lat/lon for a given point+"radius", and as long as you have an index on the lat/lon columns you could pull back all records that fall within the box pretty easily.
我知道这篇文章太旧了,但是在为客户进行一些研究后,我发现了 Google Maps API 的一些有用功能,并且实现起来非常简单,您只需将出发地和目的地邮政编码传递给 url,然后它甚至可以计算交通距离,您可以将其用于任何语言:
http://maps.googleapis.com/maps/api/distancematrix/json?origins=90210&destinations=93030&mode=driven& language=en-EN&sensor=false%22
在链接后面,您可以看到它返回一个 json。请记住,您需要一个 API 密钥才能在您自己的托管上使用它。
来源:
http://stanhub.com/find-distance- Between-two-postcodes-zipcodes-driven-time-in-current-traffic-using-google-maps-api/
I know that this post is TOO old, but making some research for a client I've found some useful functionality of Google Maps API and is so simple to implement, you just need to pass to the url the origin and destination ZIP codes, and it calculates the distance even with the traffic, you can use it with any language:
http://maps.googleapis.com/maps/api/distancematrix/json?origins=90210&destinations=93030&mode=driving&language=en-EN&sensor=false%22
following the link you can see that it returns a json. Remember that you need an API key to use this on your own hosting.
source:
http://stanhub.com/find-distance-between-two-postcodes-zipcodes-driving-time-in-current-traffic-using-google-maps-api/
您可以将空间划分为大小大致相等的区域 - 例如,将地球近似为巴基球或二十面体。如果更容易的话,这些区域甚至可以重叠一点(例如,使它们呈圆形)。记录每个邮政编码所在的区域。然后您可以预先计算每个区域对之间可能的最大距离,这与计算所有邮政编码对具有相同的O(n^2)问题,但对于较小的n。
现在,对于任何给定的邮政编码,您都可以获取绝对在给定范围内的区域列表,以及跨越边界的区域列表。对于前者,只需获取所有邮政编码即可。对于后者,深入到每个边界区域并根据各个邮政编码进行计算。
这在数学上肯定更复杂,特别是必须选择区域的数量,以便在表的大小与动态计算所花费的时间之间取得良好的平衡,但它会大大减少预先计算的表的大小利润。
You could divide your space into regions of roughly equal size -- for instance, approximate the earth as a buckyball or icosahedron. The regions could even overlap a bit, if that's easier (e.g. make them circular). Record which region(s) each ZIP code is in. Then you can precalculate the maximum distance possible between every region pair, which has the same O(n^2) problem as calculating all the ZIP code pairs, but for smaller n.
Now, for any given ZIP code, you can get a list of regions that are definitely within your given range, and a list of regions that cross the border. For the former, just grab all the ZIP codes. For the latter, drill down into each border region and calculate against individual ZIP codes.
It's certainly more complex mathematically, and in particular the number of regions would have to be chosen for a good balance between the size of the table vs. the time spent calculating on the fly, but it reduces the size of the precalculated table by a good margin.
我会使用纬度和经度。例如,如果您的纬度为 45,经度为 45,并且被要求在 50 英里内查找匹配项,那么您可以通过将纬度向上移动 50/69 并向下移动 50/69 纬度(1 度)来完成此操作。纬度〜69英里)。选择纬度在此范围内的邮政编码。经度略有不同,因为当您靠近两极时,经度会变小。
但在 45 度、1 经度 ~ 49 英里时,您可以向左移动 50/49 纬度,向右移动 50/49 纬度,并从该经度设置的纬度中选择所有邮政编码。这将为您提供长度为一百英里的正方形内的所有邮政编码。如果你想要非常精确,你可以使用你提到的半正矢公式来清除盒子角落里的拉链,给你一个球体。
I would use latitude and longitude. For example, if you have a latitude of 45 and a longitude of 45 and were asked to find matches within 50 miles, then you could do it by moving 50/69 ths up in latitude and 50/69 ths down in latitude (1 deg latitude ~ 69 miles). Select zip codes with latitudes in this range. Longitudes are a little different, because they get smaller as you move closer to the poles.
But at 45 deg, 1 longitude ~ 49 miles, so you could move 50/49ths left in latitude and 50/49ths right in latitude, and select all zip codes from the latitude set with this longitude. This gives you all zip codes within a square with lengths of a hundred miles. If you wanted to be really precise, you could then use the Haversine formula witch you mentioned to weed out zips in the corners of the box, to give you a sphere.
并非所有可能的邮政编码对都会被使用。我会将 zipdistance 构建为“缓存”表。对于每个请求,计算该对的距离并将其保存在缓存中。当对距离对的请求到来时,首先查看缓存,然后计算是否不可用。
我不知道距离计算的复杂性,所以我还会检查即时计算是否比查找更便宜(还考虑到您必须计算的频率)。
Not every possible pair of zip codes are going to be used. I would build zipdistance as a 'cache' table. For each request calculate the distance for that pair and save it in the cache. When a request for a distance pair comes, first look in the cache, then compute if it's not available.
I do not know the intricacies of distance calculations, so I would also check whether computing on the fly is cheaper than looking up (also taking into consideration how often you have to compute).
我的问题运行得很好,几乎每个人的答案都被使用了。我从旧的解决方案角度思考这个问题,而不仅仅是“重新开始”。 Babtek 因用最简单的语言表述而获得认可。
我将跳过代码,因为我将提供参考来导出所需的公式,并且这里有太多内容需要清晰地发布。
考虑球体上的点 A,由纬度和经度表示。 找出北、南、一个 2X 英里宽的盒子的东边和西边,以 A 点为中心。
从邮政编码表中选择框中的所有点。这包括一个简单的 WHERE 子句和两个由纬度和经度限制的 Between 语句。
使用半正矢公式确定 A 点与步骤 2 中返回的每个 B 点之间的球面距离。
丢弃距离 A -> 的所有点 B。 B>十.
选择 ZipCode 位于剩余点 B 集中的用户。
对于 > 来说,这相当快。 100 英里。计算匹配的最长结果约为 0.014 秒,并且运行 select 语句很简单。
另外,顺便说一句,有必要在几个函数中实现数学计算并在 SQL 中调用它们。一旦超过一定距离,匹配的 ZipCode 数量就太大,无法传递回 SQL 并用作 IN 语句,因此我必须使用临时表并将生成的 ZipCode 与 ZipCode 列上的 User 连接。
我怀疑使用 ZipDistance 表不会提供长期的性能增益。行数变得非常大。如果您计算从每个邮政编码到每个其他邮政编码(最终)的距离,那么 40,000 个邮政编码的最终行数将约为 1.6B。哇啊!
或者,我有兴趣使用 SQL 的内置地理类型来看看这是否会使这变得更容易,但良好的旧 int/float 类型非常适合此示例。
所以...我使用的在线资源的最终列表,供您轻松参考:
最大差异、纬度和经度。
半正矢公式。
冗长但完整整个过程的讨论,这是我通过谷歌搜索你的答案中找到的内容。
I have the problem running great, and pretty much everyone's answer got used. I was thinking about this in terms of the old solution instead of just "starting over." Babtek gets the nod for stating in in simplest terms.
I'll skip the code because I'll provide references to derive the needed formulas, and there is too much to cleanly post here.
Consider Point A on a sphere, represented by latitude and longitude. Figure out North, South, East, and West edges of a box 2X miles across with Point A at the center.
Select all point within the box from the ZipCode table. This includes a simple WHERE clause with two Between statements limiting by Lat and Long.
Use the haversine formula to determine the spherical distance between Point A and every point B returned in step 2.
Discard all points B where distance A -> B > X.
Select users where ZipCode is in the remaining set of points B.
This is pretty fast for > 100 miles. Longest result was ~ 0.014 seconds to calculate the match, and trivial to run the select statement.
Also, as a side note, it was necessary to implement the math in a couple of functions and call them in SQL. Once I got past a certain distance the matching number of ZipCodes was too large to pass back to SQL and use as an IN statement, so I had to use a temp table and join the resulting ZipCodes to User on the ZipCode column.
I suspect that using a ZipDistance table will not provide a long-term performance gain. The number of rows just gets really big. If you calculate the distance from every zip to to every other zip code (eventually) then the resultant row count from 40,000 zip codes would be ~ 1.6B. Whoah!
Alternately, I am interested in using SQL's built in geography type to see if that will make this easier, but good old int/float types served fine for this sample.
So... final list of online resources I used, for your easy reference:
Maximum Difference, Latitude and Longitude.
The Haversine Formula.
Lengthy but complete discussion of the whole process, which I found from Googling stuff in your answers.