有什么解决方法可以有效计算 python 中坐标列表之间的距离吗？

发布于 2025-01-09 14:38:11 字数 2600 浏览 1 评论 0原文

我的数据带有邮政编码、经度、纬度信息。我想计算一个邮政编码与其余邮政编码之间的邮政距离，然后递归地执行相同的操作，而无需在 python 中重复距离值。但是，我可以使用 geosphere R 库进行距离计算。然而，我的目标是通过 python 中的坐标获取邮政编码距离。我发现 GeoPandas 或 Geod 可能提供内置函数来计算邮政坐标距离，但仍然无法得到与 R 实现相同的结果。有谁知道如何在Python中找到坐标距离？谁能建议可能的解决方法来做到这一点？有什么想法吗？

最小数据

这是我在 R 中用于距离计算的最小数据。

> dput(df)
structure(list(post_code = c(201L, 311L, 312L, 313L, 314L, 315L, 
317L, 318L, 319L, 370L, 371L, 372L, 373L, 374L, 390L, 391L, 392L, 
396L, 397L, 398L), latitude = c(30.82, 32.08, 32.39, 32.31, 32.38, 
32.31, 32.29, 32.14, 32.2, 32.13, 32.29, 32.38, 32.16, 32.16, 
32.18, 32.19, 32.19, 32.36, 32.27, 32.07), longtitude = c(-83.03, 
-82.62, -82.52, -82.52, -82.52, -82.1, -82.33, -82.92, -82.34, 
-82.2, -82.94, -82.82, -82.61, -82.39, -82.58, -82.86, -82.56, 
-82.89, -82.69, -82.5)), row.names = c(NA, 20L), class = "data.frame")

当前的 R 尝试

这是我当前的 R 实现，用于计算不同邮政编码之间的距离；本质上，我想递归地计算一个到另一个之间的邮政编码距离。

library(geosphere)
df_src=df
df_trg=df

colnames(df_src)=c("src_post_code", "src_lat", "src_long")
colnames(df_trg)=c("trg_post_code", "trg_lat", "trg_long")

get_distance <- function(post_code, radius=1e-5){
    tmp=df_src[df_src$src_post_code==post_code,]
    dist=distHaversine(tmp[,1:2,with=FALSE],df_trg[,1:2,with=FALSE])
    res= as.data.frame(
        post_code=df_src$src_post_code,
        lat=df_src$src_lat
        long=df_src$src_long
        dist= dist*1e-5
    )
    return(res)
}

final_output= as.data.frame(lapply(df_src$src_post_code, get_distance))

但这样做效率不是很高，因为实际的帖子代码列表有 40k+，即使使用并行处理，这种计算也会给我带来计算负担。

然而，我的目标是通过摄取上述 R 逻辑在 python 中完成此操作。我认为 Geod 或 GeoPandas 可能会帮助我，仍然在 python 中获得相同的输出。谁能指出如何递归地找到一个到另一个之间的邮政编码坐标距离？有什么想法吗？

递归地，我的意思是像下面的这张图：

因此左侧的表格视图显示原始输入数据的样子；右图显示了我想要如何在 python 中递归地找到坐标距离 1。

当前 python 尝试：

from pyproj import Geod
import pandas as pd

gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df= pd.read_csv(gist, index_col=0)

df_coord = df[['src_lat', 'src_long', 'trg_lat', 'trg_long']].to_numpy().T
df['dist'] = wsg84.inv(*df_coord )[-1] / 1000

但输出与 R 代码的输出不同。谁能建议更好的方法来做到这一点？有什么更好的想法或方法可以在 python 中有效地做到这一点吗？

更新

我在下面的实际数据上尝试了 @Benoit Fgt' 解决方案，其中包含 40k+ 邮政编码和 lan/long 信息，但它给了我内存错误。有没有办法在Python中进行并行处理？有什么想法吗？

原文

I have data comes with zip/post code, longitude, latitude info. I want to calculate zip distance between one zip code against the rest then do same recursively without duplicated distance values in python. However, I am able to use geosphere R library for distance calculation. However, my objective is to get zip code distances by coordinate in python. I found GeoPandas, or Geod might provide built-in function to calculate zip coordinate distances but still not getting same out that I got from R implementation. Does anyone knows how to find coordinate distances in python? Can anyone suggest possible workaround to do this? Any thoughts?

minimal data

here is the minimal data that I used in R for distance calculation.

> dput(df)
structure(list(post_code = c(201L, 311L, 312L, 313L, 314L, 315L, 
317L, 318L, 319L, 370L, 371L, 372L, 373L, 374L, 390L, 391L, 392L, 
396L, 397L, 398L), latitude = c(30.82, 32.08, 32.39, 32.31, 32.38, 
32.31, 32.29, 32.14, 32.2, 32.13, 32.29, 32.38, 32.16, 32.16, 
32.18, 32.19, 32.19, 32.36, 32.27, 32.07), longtitude = c(-83.03, 
-82.62, -82.52, -82.52, -82.52, -82.1, -82.33, -82.92, -82.34, 
-82.2, -82.94, -82.82, -82.61, -82.39, -82.58, -82.86, -82.56, 
-82.89, -82.69, -82.5)), row.names = c(NA, 20L), class = "data.frame")

current R attempt

here is my current R implementation to calculate distance between different postal code; essentially I want to calculate zip or post code distance between one to another recursively.

library(geosphere)
df_src=df
df_trg=df

colnames(df_src)=c("src_post_code", "src_lat", "src_long")
colnames(df_trg)=c("trg_post_code", "trg_lat", "trg_long")

get_distance <- function(post_code, radius=1e-5){
    tmp=df_src[df_src$src_post_code==post_code,]
    dist=distHaversine(tmp[,1:2,with=FALSE],df_trg[,1:2,with=FALSE])
    res= as.data.frame(
        post_code=df_src$src_post_code,
        lat=df_src$src_lat
        long=df_src$src_long
        dist= dist*1e-5
    )
    return(res)
}

final_output= as.data.frame(lapply(df_src$src_post_code, get_distance))

but doing this way is not very efficient, because actual list of post code are 40k+ and doing this calculation gave me computational burden even using parallel processing.

However, my objective is doing this in python by ingesting above R logic. I think Geod or GeoPandas might help me with that, still getting same output in python. Can anyone point me out how to find zip/post code coordinate distance between one to another recursively? Any thoughts?

recursively I mean is like this graph below:

so tabular view on the left shows how original input data looks like; the graph on the right shows how I want to find coordinate distance one to rest recursively in python.

current python attempt:

from pyproj import Geod
import pandas as pd

gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df= pd.read_csv(gist, index_col=0)

df_coord = df[['src_lat', 'src_long', 'trg_lat', 'trg_long']].to_numpy().T
df['dist'] = wsg84.inv(*df_coord )[-1] / 1000

but output is not same as the one from R code. Can anyone suggest better way of doing this? Any better idea or approach to do this efficiently in python?

update

I tried @Benoit Fgt' solution below on actual data which has 40k+ zip code and lan/long info, and it gave me memory error instead. Is there way to do parallel processing in python? Any idea?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撩心不撩汉 2025-01-16 14:38:11

不是完整的答案，只是为了测试

尝试使用 sklearn：

from sklearn.neighbors import BallTree, DistanceMetric

# gist='https://gist.githubusercontent.com/adamFlyn/...'
df = pd.read_csv(gist, index_col=0)
coords = np.radians(df[['latitude', 'longtitude']])

dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)

distances, indices = tree.query(coords, k=len(df))

此代码是否有内存错误？

Not a full answer, just to test

Try with sklearn:

from sklearn.neighbors import BallTree, DistanceMetric

# gist='https://gist.githubusercontent.com/adamFlyn/...'
df = pd.read_csv(gist, index_col=0)
coords = np.radians(df[['latitude', 'longtitude']])

dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)

distances, indices = tree.query(coords, k=len(df))

Do you have memory error with this code?

回复收藏 0 原文

泼猴你往哪里跑 2025-01-16 14:38:11

我不确定示例的预期输出是什么。像这样的事情怎么样（我猜代码可以改进）：

from pyproj import Geod
import pandas as pd
import itertools

geod = Geod(ellps="WGS84")
gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df = pd.read_csv(gist)

coords = list(zip(df.latitude, df.longtitude, df.post_code))
combs = list(itertools.combinations(coords, 2))
lons1, lats1, lons2, lats2, zip_code_from, zip_code_to = [], [], [], [], [], []
zip_code_from, zip_code_to
for pair in combs: 
    lons1.append(pair[0][0])
    lats1.append(pair[0][1])
    lons2.append(pair[1][0])
    lats2.append(pair[1][1])
    zip_code_from.append(pair[0][2])
    zip_code_to.append(pair[1][2])
    
az12, az21, dist = geod.inv(lons1, lats1, lons2, lats2)
pd.DataFrame(list(zip(zip_code_from, zip_code_to, dist)), columns=['pos_code_from', 'pos_code_to', 'dist']).sort_values('dist')

这会导致：

    pos_code_from   pos_code_to dist
38  312 314 145.395037
54  313 314 1017.765195
37  312 313 1163.160201
29  311 373 1601.912638
100 317 319 1744.891723
... ... ... ...
96  315 396 88226.021581
86  315 318 91606.658285
89  315 371 93807.304276
8   201 370 94573.135189
4   201 315 106058.180885

I am not sure what is the expected output from the sample. How about something like this (the code can be improved I guess):

from pyproj import Geod
import pandas as pd
import itertools

geod = Geod(ellps="WGS84")
gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df = pd.read_csv(gist)

coords = list(zip(df.latitude, df.longtitude, df.post_code))
combs = list(itertools.combinations(coords, 2))
lons1, lats1, lons2, lats2, zip_code_from, zip_code_to = [], [], [], [], [], []
zip_code_from, zip_code_to
for pair in combs: 
    lons1.append(pair[0][0])
    lats1.append(pair[0][1])
    lons2.append(pair[1][0])
    lats2.append(pair[1][1])
    zip_code_from.append(pair[0][2])
    zip_code_to.append(pair[1][2])
    
az12, az21, dist = geod.inv(lons1, lats1, lons2, lats2)
pd.DataFrame(list(zip(zip_code_from, zip_code_to, dist)), columns=['pos_code_from', 'pos_code_to', 'dist']).sort_values('dist')

This results to:

    pos_code_from   pos_code_to dist
38  312 314 145.395037
54  313 314 1017.765195
37  312 313 1163.160201
29  311 373 1601.912638
100 317 319 1744.891723
... ... ... ...
96  315 396 88226.021581
86  315 318 91606.658285
89  315 371 93807.304276
8   201 370 94573.135189
4   201 315 106058.180885

回复收藏 0 原文

~没有更多了~