如何将工作循环的工作转移到熊猫的应用程序更快的应用功能？

发布于 2025-02-01 08:19:09 字数 1530 浏览 4 评论 0原文

我有一个带有经度和纬度列的数据框。我需要借助Geopy套餐来基于长长和LAT值来获取位置的县名。

 longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0

我在for循环方面取得了成功：

geolocator = geopy.Nominatim(user_agent='1234')

for index, row in df.iloc[:10, :].iterrows():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

数据集有17,000行，所以这应该是一个问题，对吗？

因此，我一直在尝试弄清楚如何构建可以在pandas.apply（）中使用的函数，以便获得更快的结果。

def get_zipcodes():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

counties = get_zipcodes()

我陷入困境，不知道如何在此处使用（或任何其他聪明的方法）。帮助您非常感谢。

原文

I have a dataframe with longitude and latitude columns. I need to get the county name for the location based on long and lat values with the help of the geoPy package.

 longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0

I had success with a for loop:

geolocator = geopy.Nominatim(user_agent='1234')

for index, row in df.iloc[:10, :].iterrows():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

The dataset has 17,000 rows, so that should be a problem, right?

So I've been trying to figure out how to build a function which I could use in pandas.apply() in order to get quicker results.

def get_zipcodes():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

counties = get_zipcodes()

I'm stuck and don't know how to use apply (or any other clever method) in here. Help is much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七禾 2025-02-08 08:19:09

使用Geopy时，代码中的熊猫计算不太可能是速度瓶颈（请参阅这个答案对不同的地理问题）。

但是，如果有可能有大量的行带有重复纬度，经度坐标，则可以使用@Cache（或@lru_cache（ none））来自Functools的装饰器。

这是在数据框架上使用apply（）没有特殊缓存的方法：

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)

完整的测试代码：

import geopy
geolocator = geopy.Nominatim(user_agent='1234')
import pandas as pd
df = pd.DataFrame({
'longitude':[-114.31,-114.47,-114.56,-114.57,-114.57], 
'latitude':[34.19,34.40,33.69,33.64,33.57], 
'housing_median_age':[15]*5, 
'total_rooms':[1000]*5, 
'total_bedrooms':[500]*5, 
'population':[800]*5})

print(df)

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)
print(df)

输入：

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population
0    -114.31     34.19                  15         1000             500         800
1    -114.47     34.40                  15         1000             500         800
2    -114.56     33.69                  15         1000             500         800
3    -114.57     33.64                  15         1000             500         800
4    -114.57     33.57                  15         1000             500         800

输出：

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population                 county
0    -114.31     34.19                  15         1000             500         800  San Bernardino County
1    -114.47     34.40                  15         1000             500         800  San Bernardino County
2    -114.56     33.69                  15         1000             500         800       Riverside County
3    -114.57     33.64                  15         1000             500         800       Riverside County
4    -114.57     33.57                  15         1000             500         800       Riverside County

这是如何使用装饰器来缓存结果（即，避免一路走来多次进入地理服务器），用于相同的纬度，经度坐标：

from functools import cache
@cache
def bar(lat, long):
    return geolocator.reverse([lat, long]).raw["address"]["county"]

def foo(row):
    return bar(row["latitude"], row["longitude"])
df["county"] = df.apply(foo, axis=1)

The pandas calculations in your code are unlikely to be the speed bottleneck when using geopy (see this answer to a different geopy question).

However, if there's a possibility that there may be a significant number of rows with duplicate latitude, longitude coordinates, you can use the @cache (or @lru_cache(None)) decorator from functools.

Here is how to use apply() on your dataframe with no special caching:

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)

Full test code:

import geopy
geolocator = geopy.Nominatim(user_agent='1234')
import pandas as pd
df = pd.DataFrame({
'longitude':[-114.31,-114.47,-114.56,-114.57,-114.57], 
'latitude':[34.19,34.40,33.69,33.64,33.57], 
'housing_median_age':[15]*5, 
'total_rooms':[1000]*5, 
'total_bedrooms':[500]*5, 
'population':[800]*5})

print(df)

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)
print(df)

Input:

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population
0    -114.31     34.19                  15         1000             500         800
1    -114.47     34.40                  15         1000             500         800
2    -114.56     33.69                  15         1000             500         800
3    -114.57     33.64                  15         1000             500         800
4    -114.57     33.57                  15         1000             500         800

Output:

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population                 county
0    -114.31     34.19                  15         1000             500         800  San Bernardino County
1    -114.47     34.40                  15         1000             500         800  San Bernardino County
2    -114.56     33.69                  15         1000             500         800       Riverside County
3    -114.57     33.64                  15         1000             500         800       Riverside County
4    -114.57     33.57                  15         1000             500         800       Riverside County

Here is how to use a decorator to cache results (i.e., to avoid going all the way out to a geopy server multiple times) for identical latitude, longitude coordinates:

from functools import cache
@cache
def bar(lat, long):
    return geolocator.reverse([lat, long]).raw["address"]["county"]

def foo(row):
    return bar(row["latitude"], row["longitude"])
df["county"] = df.apply(foo, axis=1)

回复收藏 0 原文

~没有更多了~