Python 中匹配地理空间数据集的快速方法
我有一组 2000 个地理空间点(经度/纬度),我需要将其与其他几个地理空间数据集(我正在使用 Geopandas GeoDataFrames)进行匹配。我使用 sklearn BallTree 函数来查找每个点一定半径内的邻居(在下面的函数中,point 是 2000 个点之一,right_gdf 是我需要从中获取邻居的数据集)。
我目前正在使用 for 循环来遍历所有 2000 个点并找到每个点的邻居。但是,根据 right_gdf 的大小,这可能需要很长时间。我确信有一种方法可以加快这个过程,可能是通过并行计算,但我正在努力寻找它。我尝试使用 Dask 延迟来并行化循环(请参阅下面的代码),但不知何故,这比简单的 for 循环花费的时间更长。
# Function that finds a point's neighbors within a certain radius
def neighbours_radius(point, right_gdf, R=1):
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
return indices
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
ind = neighbours_radius(point, right_gdf, R=R)
# append index lists
neighbors.append(ind)
return neighbors
# Function that loops through the 2000 points with Dask delayed
def knn_gpd_dask(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base):
point = base.iloc[i:i+1,:]
ind = delayed(neighbours_radius)(point, right_gdf, R=R)
# append index list
neighbors.append(ind)
result = compute(neighbors)
return result
谁能帮助我加快这个过程?
I have a set of 2000 geospatial points (lon/lat), which I need to match with several other geospatial datasets (I am using Geopandas GeoDataFrames). I am using the sklearn BallTree function to find the neighbors within a certain radius of each point (in the function below, point is one of the 2000 points and right_gdf is the dataset that I need to get the neighbors from).
I am currently using a for-loop to loop through all of the 2000 points and find the neighbors for each of them. However, depending on the size of right_gdf, this can take a long time. I am sure there is a way to speed this process up, potentially with parallel computing, but I am struggling to find it. I tried to use Dask delayed to parellelise the loop (see code below) but somehow this takes even longer than the simple for loop.
# Function that finds a point's neighbors within a certain radius
def neighbours_radius(point, right_gdf, R=1):
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
return indices
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
ind = neighbours_radius(point, right_gdf, R=R)
# append index lists
neighbors.append(ind)
return neighbors
# Function that loops through the 2000 points with Dask delayed
def knn_gpd_dask(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base):
point = base.iloc[i:i+1,:]
ind = delayed(neighbours_radius)(point, right_gdf, R=R)
# append index list
neighbors.append(ind)
result = compute(neighbors)
return result
Can anyone help me speed up this process?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您分析您的代码,我怀疑您会发现创建
BallTree
占用了大部分时间,因为您创建了它 2000 次。您应该尝试仅创建一次,例如:If you profile your code, I suspect you will find that creating the
BallTree
is taking up most of the time, because you are creating it 2000 times. You should try to create it only once, like this for example: