提高linestring创建的性能,该创建目前是由lambda函数创建的
我有一个这样的数据框(此示例只有四行,但实际上它具有O(10^6)行):
DF:
nodeid lon lat wayid
0 1 1.70 42.10 52
1 2 1.80 42.30 52
2 3 1.75 42.20 53
3 4 1.72 42.05 53
我需要按WayID
lon 和
lat
组中每个元素的列,以获取这样的输出:
output:
wayid
52 LINESTRING (1.7 42.1, 1.8 42.3)
53 LINESTRING (1.75 42.2, 1.72 42.05)
dtype: object
我可以通过:我可以获取所需的输出示例dataframe
DF = pd.DataFrame([[1, 1.7, 42.1, 52], [2, 1.8, 42.3, 52], [3, 1.75, 42.2, 53], [4, 1.72, 42.05, 53]])
DF.columns = ['nodeid', 'lon', 'lat', 'wayid']
,并应用这样的lambda函数:
DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))
但是,这是非常缓慢的过程,我需要以某种方式改进它(此外,出现警告消息)。
关于如何通过提高性能获得相同结果的任何想法?
注意: 最后,实际上,我需要这样的地理框架:
GDF = gp.GeoDataFrame(geometry=DF.groupby('wayid')\
.apply(lambda r: LineString(np.array(r[['lon','lat']])))
如果它有助于设计更好的解决方案。
I have a dataframe like this (this example has only four rows, but in practice it has O(10^6) rows):
DF:
nodeid lon lat wayid
0 1 1.70 42.10 52
1 2 1.80 42.30 52
2 3 1.75 42.20 53
3 4 1.72 42.05 53
I need to group by wayid
and concatenate the lon
and lat
columns of each element in the group, to obtain an output like this:
output:
wayid
52 LINESTRING (1.7 42.1, 1.8 42.3)
53 LINESTRING (1.75 42.2, 1.72 42.05)
dtype: object
I can create the example DataFrame by:
DF = pd.DataFrame([[1, 1.7, 42.1, 52], [2, 1.8, 42.3, 52], [3, 1.75, 42.2, 53], [4, 1.72, 42.05, 53]])
DF.columns = ['nodeid', 'lon', 'lat', 'wayid']
And I can obtain the desired output, applying a lambda function like this:
DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))
However, this is quite slow process and I need to improve it somehow (besides that a Warning message appears).
Any ideas on how can I obtain the same result by improving performance?
NOTE: in the end, in reality I need a GeoDataFrame like this:
GDF = gp.GeoDataFrame(geometry=DF.groupby('wayid')\
.apply(lambda r: LineString(np.array(r[['lon','lat']])))
In case it helps to design a better solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来很合理。使用10^6 lat/lon Pairs模拟了数据框。通过删除 numpy 数组的创建作为
r
是一个dataFrame,您可以在其中访问 numpy randay 。 /代码>输出
Looks reasonable. Have simulated dataframe with 10^6 lat/lon pairs. There is a small optimisation by removing creation of numpy array as
r
is a dataframe where you can access numpy array with.values
output