如何将工作循环的工作转移到熊猫的应用程序更快的应用功能?
我有一个带有经度和纬度列的数据框。我需要借助Geopy套餐来基于长长和LAT值来获取位置的县名。
longitude latitude housing_median_age total_rooms total_bedrooms \
0 -114.31 34.19 15.0 5612.0 1283.0
1 -114.47 34.40 19.0 7650.0 1901.0
2 -114.56 33.69 17.0 720.0 174.0
3 -114.57 33.64 14.0 1501.0 337.0
4 -114.57 33.57 20.0 1454.0 326.0
population households median_income median_house_value
0 1015.0 472.0 1.4936 66900.0
1 1129.0 463.0 1.8200 80100.0
2 333.0 117.0 1.6509 85700.0
3 515.0 226.0 3.1917 73400.0
4 624.0 262.0 1.9250 65500.0
我在for循环方面取得了成功:
geolocator = geopy.Nominatim(user_agent='1234')
for index, row in df.iloc[:10, :].iterrows():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
数据集有17,000行,所以这应该是一个问题,对吗?
因此,我一直在尝试弄清楚如何构建可以在pandas.apply()中使用的函数,以便获得更快的结果。
def get_zipcodes():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
counties = get_zipcodes()
我陷入困境,不知道如何在此处使用(或任何其他聪明的方法)。帮助您非常感谢。
I have a dataframe with longitude and latitude columns. I need to get the county name for the location based on long and lat values with the help of the geoPy package.
longitude latitude housing_median_age total_rooms total_bedrooms \
0 -114.31 34.19 15.0 5612.0 1283.0
1 -114.47 34.40 19.0 7650.0 1901.0
2 -114.56 33.69 17.0 720.0 174.0
3 -114.57 33.64 14.0 1501.0 337.0
4 -114.57 33.57 20.0 1454.0 326.0
population households median_income median_house_value
0 1015.0 472.0 1.4936 66900.0
1 1129.0 463.0 1.8200 80100.0
2 333.0 117.0 1.6509 85700.0
3 515.0 226.0 3.1917 73400.0
4 624.0 262.0 1.9250 65500.0
I had success with a for loop:
geolocator = geopy.Nominatim(user_agent='1234')
for index, row in df.iloc[:10, :].iterrows():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
The dataset has 17,000 rows, so that should be a problem, right?
So I've been trying to figure out how to build a function which I could use in pandas.apply() in order to get quicker results.
def get_zipcodes():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
counties = get_zipcodes()
I'm stuck and don't know how to use apply (or any other clever method) in here. Help is much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用Geopy时,代码中的熊猫计算不太可能是速度瓶颈(请参阅这个答案对不同的地理问题)。
但是,如果有可能有大量的行带有重复
纬度,经度
坐标,则可以使用@Cache
(或@lru_cache( none)
)来自Functools的装饰器。这是在数据框架上使用
apply()
没有特殊缓存的方法:完整的测试代码:
输入:
输出:
这是如何使用装饰器来缓存结果(即,避免一路走来多次进入地理服务器),用于相同的
纬度,经度
坐标:The pandas calculations in your code are unlikely to be the speed bottleneck when using geopy (see this answer to a different geopy question).
However, if there's a possibility that there may be a significant number of rows with duplicate
latitude, longitude
coordinates, you can use the@cache
(or@lru_cache(None)
) decorator from functools.Here is how to use
apply()
on your dataframe with no special caching:Full test code:
Input:
Output:
Here is how to use a decorator to cache results (i.e., to avoid going all the way out to a geopy server multiple times) for identical
latitude, longitude
coordinates: