Python创建数据点的所有组合并根据功能过滤它们
我有一个位置表(现在在数据框架中),并想计算所有组合及其彼此的距离。
输入:
ID | LAT | LON |
---|---|---|
1 | 6,4355 | 53,2245 |
2 | 5,3434 | 50,2345 |
3 | 4,3434 | 51,2345 |
所需结果:
ID1 | ID2 | 距离 |
---|---|---|
1 | 1 | 1 |
1 | 0 | 1 1 |
0 1 2 | 1 2 1 1 3 2 | 2 2 1 0 2 2 |
2 | 1 | 0 2 |
2 2 2 2 2 | 3 | 2 |
3 2 | 3 | 4 4 |
3 | 1 | 0 |
3 | 2 | 5 3 |
3 | 3 | 6 |
def distance(lat1, lon1, lat2, lon2):
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
R = 6373.0
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return round(R * c)
现在我以一种丑陋的方式循环浏览数据框2x,以至于我什至不会显示,但它起作用。问题是,当桌子变大时,它的速度非常慢,我知道必须有更快的方法来做到这一点。
如果我可以在标准的python/pandas/numpy中执行此操作(只要它的快速,我就不必使用晦涩的软件包!) 任何帮助将不胜感激 哦,我想过滤距离< 10公里,忘了添加!
在这里,我想改进的当前代码:
df_distance = pandas.DataFrame(columns=['ID1', 'ID2', 'distance'])
""" first all id with themselves """
for index, row in df.iterrows():
df_new_row = pandas.DataFrame([{'ID1': row['ID'], 'ID2': row['ID'],
'distance': 0, 'lat1': row['Lat'], 'lon1': row['Lon'],
'lat2': row['Lat'], 'lon2': row['Lon']}])
df_distance = pandas.concat([df_distance, df_new_row])
for index1, row1 in df.iterrows():
for index2, row2 in df.iterrows():
if index2 > index1:
dist = distance(row1['Lat'], row1['Lon'], row2['Lat'], row2['Lon'])
if dist <= 10: # filter at lower than 10km
""" add both directions """
df_new_row = pandas.DataFrame([{'ID1': row1['ID'], 'ID2': row2['ID'],
'distance': dist, 'lat1': row1['Lat'], 'lon1': row1['Lon'],
'lat2': row2['Lat'], 'lon2': row2['Lon']},
{'ID1': row2['ID'], 'ID2': row1['ID'],
'distance': dist, 'lat1': row2['Lat'], 'lon1': row2['Lon'],
'lat2': row1['Lat'], 'lon2': row1['Lon']}
])
df_distance = pandas.concat([df_distance, df_new_row])
I have a table of locations (right now in dataframe) and want to calculate all combinations and their distance from eachother.
Input:
ID | Lat | Lon |
---|---|---|
1 | 6,4355 | 53,2245 |
2 | 5,3434 | 50,2345 |
3 | 4,3434 | 51,2345 |
Desired Outcome:
ID1 | ID2 | distance |
---|---|---|
1 | 1 | 0 |
1 | 2 | 1 |
1 | 3 | 2 |
2 | 1 | 0 |
2 | 2 | 3 |
2 | 3 | 4 |
3 | 1 | 0 |
3 | 2 | 5 |
3 | 3 | 6 |
def distance(lat1, lon1, lat2, lon2):
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
R = 6373.0
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return round(R * c)
Right now i loop through the dataframe 2x in such an ugly way that i'm not even going to show, but it works. Problem is that it is terribly slow when the table gets big and i know there must be a faster way to do this.
If i can do this in standard python/pandas/numpy (as long as its fast and i dont have to use obscure packages!)
Any help would be much appreciated
Oh and i want to filter on distance < 10km, forgot to add!!
Here my current code i want to improve:
df_distance = pandas.DataFrame(columns=['ID1', 'ID2', 'distance'])
""" first all id with themselves """
for index, row in df.iterrows():
df_new_row = pandas.DataFrame([{'ID1': row['ID'], 'ID2': row['ID'],
'distance': 0, 'lat1': row['Lat'], 'lon1': row['Lon'],
'lat2': row['Lat'], 'lon2': row['Lon']}])
df_distance = pandas.concat([df_distance, df_new_row])
for index1, row1 in df.iterrows():
for index2, row2 in df.iterrows():
if index2 > index1:
dist = distance(row1['Lat'], row1['Lon'], row2['Lat'], row2['Lon'])
if dist <= 10: # filter at lower than 10km
""" add both directions """
df_new_row = pandas.DataFrame([{'ID1': row1['ID'], 'ID2': row2['ID'],
'distance': dist, 'lat1': row1['Lat'], 'lon1': row1['Lon'],
'lat2': row2['Lat'], 'lon2': row2['Lon']},
{'ID1': row2['ID'], 'ID2': row1['ID'],
'distance': dist, 'lat1': row2['Lat'], 'lon1': row2['Lon'],
'lat2': row1['Lat'], 'lon2': row1['Lon']}
])
df_distance = pandas.concat([df_distance, df_new_row])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通常使用Itertools Import Compinations 的
。
当然,您可以使用“键”,列出综合等。根据您的输入,以获取正确的值,但是编程仍然是解决难题 - 您现在拥有所需的一切:)
Little Offtop -bearning:
计算所有组合(完整路线)是o(n!),这基本上意味着如果您有〜&gt; 30分(取决于您的计算机),忘记在您的一生中对其进行计算。但这对每对应该很好,具体取决于您拥有的数量,但这只是O(n2):)
@Edit:通常您不会降低O(n2)的复杂性,而是生成Numpy二维矩阵并计算距离并计算距离在整个结构中,将大大加速过程,因为Numpy将数据切片推向了处理器缓存,这是常规迭代问题中的瓶颈。考虑一下数据是否超过了您的RAM,无论如何它都会很慢,因此您应该确保数据要尽可能紧凑,以使大数据不需要任何不必要的东西。
您可能会考虑的另一件事是,仅使用流行方法在更多线程中进行此操作,只需将数据分开并在完成时合并。
通常,您可以在Google中找到有关以“低级别”方式优化代码的书:)
Generally use
from itertools import combinations
.Of course you can use 'key', list comprehensions etc. to get correct values depending on your input, but programming is still about solving puzzles - you now have everything you need :)
Little offtop-warning:
Calculating all combinations (full routes) is O(n!), which basically means if you have ~ > 30 points (depending on your computer), forget about calculating it in your lifetime. But it should be fine for each pair, depending how many of them you have, but it's just O(n2) :)
@Edit: Generally you won't reduce O(n2) complexity, but generating numpy two dimensional matrix and calculate distance across this structure will speedup the process a lot, because numpy pushes slices of data to processor cache which is bottleneck in regular iterative problems. Consider if data exceeded your RAM it'll be slow anyway, so you should ensure your data to calculate is as compact as possible for large data, to don't hold anything that's unnecessary.
Other thing you might consider, is just use popular methods to do this in more threads, just split the data and merge it at finish.
Generally you can find a book in google about optimizing your code in 'low level' ways :)
@jop我无法自己回答,所以我以不同的方式提出了您的问题,请检查此解决方案:
numpy中最快的方法以获得阵列中n对产物的距离
有效过滤有效过滤效率, every result above 10km could be easily done by this code fragment:
To load data from pandas dataframe you can follow these topics:
将pandas dataframe中的选择列转换为numpy阵列
我希望它能解决您的性能问题,请在评论中知道解决难题是如何解决的:)
以解决拼图,您只需要知道如何标记正确的索引即可t丢失了数据的合并。
PS:我相信这应该比您当前的解决方案快60倍。
@jop I was unable to answer on my own, so I formulated your question differently, please check this solution:
Fastest way in numpy to get distance of product of n pairs in array
Filtering results to filter efficiently every result above 10km could be easily done by this code fragment:
To load data from pandas dataframe you can follow these topics:
Selecting multiple columns in a Pandas dataframe
Convert Select Columns in Pandas Dataframe to Numpy Array
I hope it solves your performance issue, please let know in comments how solving puzzles went :)
To solve puzzle you need only know how to mark correct indices to don't lose your data which point with which is combined.
PS: I believe that should be at least 60 times faster then your current solution.