提高linestring创建的性能,该创建目前是由lambda函数创建的

发布于 2025-01-28 16:41:26 字数 1134 浏览 4 评论 0原文

我有一个这样的数据框(此示例只有四行,但实际上它具有O(10^6)行):

DF:

    nodeid   lon      lat   wayid
0        1  1.70    42.10      52
1        2  1.80    42.30      52
2        3  1.75    42.20      53
3        4  1.72    42.05      53

我需要按WayID lon 和lat组中每个元素的列,以获取这样的输出:

output:

wayid
52    LINESTRING (1.7 42.1, 1.8 42.3)
53    LINESTRING (1.75 42.2, 1.72 42.05)
dtype: object

我可以通过:我可以获取所需的输出示例dataframe

DF = pd.DataFrame([[1, 1.7, 42.1, 52], [2, 1.8, 42.3, 52], [3, 1.75, 42.2, 53], [4, 1.72, 42.05, 53]])
DF.columns = ['nodeid', 'lon', 'lat', 'wayid']

,并应用这样的lambda函数:

DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))

但是,这是非常缓慢的过程,我需要以某种方式改进它(此外,出现警告消息)。

关于如何通过提高性能获得相同结果的任何想法?

注意: 最后,实际上,我需要这样的地理框架:

GDF = gp.GeoDataFrame(geometry=DF.groupby('wayid')\
                                 .apply(lambda r: LineString(np.array(r[['lon','lat']])))

如果它有助于设计更好的解决方案。

I have a dataframe like this (this example has only four rows, but in practice it has O(10^6) rows):

DF:

    nodeid   lon      lat   wayid
0        1  1.70    42.10      52
1        2  1.80    42.30      52
2        3  1.75    42.20      53
3        4  1.72    42.05      53

I need to group by wayid and concatenate the lon and lat columns of each element in the group, to obtain an output like this:

output:

wayid
52    LINESTRING (1.7 42.1, 1.8 42.3)
53    LINESTRING (1.75 42.2, 1.72 42.05)
dtype: object

I can create the example DataFrame by:

DF = pd.DataFrame([[1, 1.7, 42.1, 52], [2, 1.8, 42.3, 52], [3, 1.75, 42.2, 53], [4, 1.72, 42.05, 53]])
DF.columns = ['nodeid', 'lon', 'lat', 'wayid']

And I can obtain the desired output, applying a lambda function like this:

DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))

However, this is quite slow process and I need to improve it somehow (besides that a Warning message appears).

Any ideas on how can I obtain the same result by improving performance?

NOTE: in the end, in reality I need a GeoDataFrame like this:

GDF = gp.GeoDataFrame(geometry=DF.groupby('wayid')\
                                 .apply(lambda r: LineString(np.array(r[['lon','lat']])))

In case it helps to design a better solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

总以为 2025-02-04 16:41:26

看起来很合理。使用10^6 lat/lon Pairs模拟了数据框。通过删除 numpy 数组的创建作为r是一个dataFrame,您可以在其中访问 numpy randay 。 /代码>

import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import LineString
import warnings
warnings.filterwarnings('ignore')

cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
c = cities.sample(1)
p = c["geometry"].values[0]
SIZE = 1000
N=20

x = np.random.uniform(p.x - 2, p.x + 2, SIZE)
y = np.random.uniform(p.y - 2, p.y + 2, SIZE)
grid = np.random.randint(0, SIZE//N, size=[SIZE, SIZE])

DF = pd.DataFrame(
    [
        {"wayid": way, "lon": x[g[1]], "lat": y[g[0]]}
        for way in range(SIZE//N)
        for g in np.argwhere(grid == way)
    ]
)

%timeit GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))
%timeit GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(r.loc[:,["lon","lat"]].values))

GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(r.loc[:,["lon","lat"]].values))
GDF = gpd.GeoDataFrame(geometry=GEOMETRY, crs=cities.crs)

print(f"""DF: {DF.shape}
GDF: {GDF.shape}""")

输出

137 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
134 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
DF: (1000000, 3)
GDF: (50, 1)

Looks reasonable. Have simulated dataframe with 10^6 lat/lon pairs. There is a small optimisation by removing creation of numpy array as r is a dataframe where you can access numpy array with .values

import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import LineString
import warnings
warnings.filterwarnings('ignore')

cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
c = cities.sample(1)
p = c["geometry"].values[0]
SIZE = 1000
N=20

x = np.random.uniform(p.x - 2, p.x + 2, SIZE)
y = np.random.uniform(p.y - 2, p.y + 2, SIZE)
grid = np.random.randint(0, SIZE//N, size=[SIZE, SIZE])

DF = pd.DataFrame(
    [
        {"wayid": way, "lon": x[g[1]], "lat": y[g[0]]}
        for way in range(SIZE//N)
        for g in np.argwhere(grid == way)
    ]
)

%timeit GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(np.array(r[['lon','lat']])))
%timeit GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(r.loc[:,["lon","lat"]].values))

GEOMETRY = DF.groupby('wayid').apply(lambda r: LineString(r.loc[:,["lon","lat"]].values))
GDF = gpd.GeoDataFrame(geometry=GEOMETRY, crs=cities.crs)

print(f"""DF: {DF.shape}
GDF: {GDF.shape}""")

output

137 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
134 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
DF: (1000000, 3)
GDF: (50, 1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文