有什么解决方法可以有效计算 python 中坐标列表之间的距离吗?
我的数据带有邮政编码、经度、纬度信息。我想计算一个邮政编码与其余邮政编码之间的邮政距离,然后递归地执行相同的操作,而无需在 python 中重复距离值。但是,我可以使用 geosphere R 库进行距离计算。然而,我的目标是通过 python 中的坐标获取邮政编码距离。我发现 GeoPandas 或 Geod 可能提供内置函数来计算邮政坐标距离,但仍然无法得到与 R 实现相同的结果。有谁知道如何在Python中找到坐标距离?谁能建议可能的解决方法来做到这一点?有什么想法吗?
最小数据
这是我在 R 中用于距离计算的最小数据。
> dput(df)
structure(list(post_code = c(201L, 311L, 312L, 313L, 314L, 315L,
317L, 318L, 319L, 370L, 371L, 372L, 373L, 374L, 390L, 391L, 392L,
396L, 397L, 398L), latitude = c(30.82, 32.08, 32.39, 32.31, 32.38,
32.31, 32.29, 32.14, 32.2, 32.13, 32.29, 32.38, 32.16, 32.16,
32.18, 32.19, 32.19, 32.36, 32.27, 32.07), longtitude = c(-83.03,
-82.62, -82.52, -82.52, -82.52, -82.1, -82.33, -82.92, -82.34,
-82.2, -82.94, -82.82, -82.61, -82.39, -82.58, -82.86, -82.56,
-82.89, -82.69, -82.5)), row.names = c(NA, 20L), class = "data.frame")
当前的 R 尝试
这是我当前的 R 实现,用于计算不同邮政编码之间的距离;本质上,我想递归地计算一个到另一个之间的邮政编码距离。
library(geosphere)
df_src=df
df_trg=df
colnames(df_src)=c("src_post_code", "src_lat", "src_long")
colnames(df_trg)=c("trg_post_code", "trg_lat", "trg_long")
get_distance <- function(post_code, radius=1e-5){
tmp=df_src[df_src$src_post_code==post_code,]
dist=distHaversine(tmp[,1:2,with=FALSE],df_trg[,1:2,with=FALSE])
res= as.data.frame(
post_code=df_src$src_post_code,
lat=df_src$src_lat
long=df_src$src_long
dist= dist*1e-5
)
return(res)
}
final_output= as.data.frame(lapply(df_src$src_post_code, get_distance))
但这样做效率不是很高,因为实际的帖子代码列表有 40k+,即使使用并行处理,这种计算也会给我带来计算负担。
然而,我的目标是通过摄取上述 R 逻辑在 python 中完成此操作。我认为 Geod 或 GeoPandas 可能会帮助我,仍然在 python 中获得相同的输出。谁能指出如何递归地找到一个到另一个之间的邮政编码坐标距离?有什么想法吗?
递归地,我的意思是像下面的这张图:
因此左侧的表格视图显示原始输入数据的样子;右图显示了我想要如何在 python 中递归地找到坐标距离 1。
当前 python 尝试:
from pyproj import Geod
import pandas as pd
gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df= pd.read_csv(gist, index_col=0)
df_coord = df[['src_lat', 'src_long', 'trg_lat', 'trg_long']].to_numpy().T
df['dist'] = wsg84.inv(*df_coord )[-1] / 1000
但输出与 R 代码的输出不同。谁能建议更好的方法来做到这一点?有什么更好的想法或方法可以在 python 中有效地做到这一点吗?
更新
我在下面的实际数据上尝试了 @Benoit Fgt' 解决方案,其中包含 40k+ 邮政编码和 lan/long 信息,但它给了我内存错误。有没有办法在Python中进行并行处理?有什么想法吗?
I have data comes with zip/post code, longitude, latitude info. I want to calculate zip distance between one zip code against the rest then do same recursively without duplicated distance values in python. However, I am able to use geosphere
R library for distance calculation. However, my objective is to get zip code distances by coordinate in python. I found GeoPandas
, or Geod
might provide built-in function to calculate zip coordinate distances but still not getting same out that I got from R implementation. Does anyone knows how to find coordinate distances in python? Can anyone suggest possible workaround to do this? Any thoughts?
minimal data
here is the minimal data that I used in R for distance calculation.
> dput(df)
structure(list(post_code = c(201L, 311L, 312L, 313L, 314L, 315L,
317L, 318L, 319L, 370L, 371L, 372L, 373L, 374L, 390L, 391L, 392L,
396L, 397L, 398L), latitude = c(30.82, 32.08, 32.39, 32.31, 32.38,
32.31, 32.29, 32.14, 32.2, 32.13, 32.29, 32.38, 32.16, 32.16,
32.18, 32.19, 32.19, 32.36, 32.27, 32.07), longtitude = c(-83.03,
-82.62, -82.52, -82.52, -82.52, -82.1, -82.33, -82.92, -82.34,
-82.2, -82.94, -82.82, -82.61, -82.39, -82.58, -82.86, -82.56,
-82.89, -82.69, -82.5)), row.names = c(NA, 20L), class = "data.frame")
current R attempt
here is my current R implementation to calculate distance between different postal code; essentially I want to calculate zip or post code distance between one to another recursively.
library(geosphere)
df_src=df
df_trg=df
colnames(df_src)=c("src_post_code", "src_lat", "src_long")
colnames(df_trg)=c("trg_post_code", "trg_lat", "trg_long")
get_distance <- function(post_code, radius=1e-5){
tmp=df_src[df_src$src_post_code==post_code,]
dist=distHaversine(tmp[,1:2,with=FALSE],df_trg[,1:2,with=FALSE])
res= as.data.frame(
post_code=df_src$src_post_code,
lat=df_src$src_lat
long=df_src$src_long
dist= dist*1e-5
)
return(res)
}
final_output= as.data.frame(lapply(df_src$src_post_code, get_distance))
but doing this way is not very efficient, because actual list of post code are 40k+ and doing this calculation gave me computational burden even using parallel processing.
However, my objective is doing this in python by ingesting above R logic. I think Geod or GeoPandas might help me with that, still getting same output in python. Can anyone point me out how to find zip/post code coordinate distance between one to another recursively? Any thoughts?
recursively I mean is like this graph below:
so tabular view on the left shows how original input data looks like; the graph on the right shows how I want to find coordinate distance one to rest recursively in python.
current python attempt:
from pyproj import Geod
import pandas as pd
gist='https://gist.githubusercontent.com/adamFlyn/8f89821df2c09e3196849095d6203e07/raw/6348a43252966be69d4e2c826aaa1c39e113c899/zip_code_data.csv'
df= pd.read_csv(gist, index_col=0)
df_coord = df[['src_lat', 'src_long', 'trg_lat', 'trg_long']].to_numpy().T
df['dist'] = wsg84.inv(*df_coord )[-1] / 1000
but output is not same as the one from R code. Can anyone suggest better way of doing this? Any better idea or approach to do this efficiently in python?
update
I tried @Benoit Fgt' solution below on actual data which has 40k+ zip code and lan/long info, and it gave me memory error instead. Is there way to do parallel processing in python? Any idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不是完整的答案,只是为了测试
尝试使用 sklearn:
此代码是否有内存错误?
Not a full answer, just to test
Try with sklearn:
Do you have memory error with this code?
我不确定示例的预期输出是什么。像这样的事情怎么样(我猜代码可以改进):
这会导致:
I am not sure what is the expected output from the sample. How about something like this (the code can be improved I guess):
This results to: