如何优化Python代码来计算两个GPS点之间的距离

发布于 2025-01-21 02:55:27 字数 959 浏览 6 评论 0原文

我正在寻找一种更快的方法来优化我的 python 代码来计算两个 GPS 点、经度和纬度之间的距离。这是我的代码,我想优化它以使其工作得更快。

 def CalcDistanceKM(lat1, lon1, lat2, lon2):
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        distance = 6371 * c

        return distance

此代码的行为是计算两个不同 excel(CSV 文件)中两个纬度和经度之间的距离,并返回它们之间的距离。

更多代码来解释该行为:

for i in range(File1):
            for j in range(File2):
                if File1['AA'][i] == File2['BB'][j]:
                            distance = CalcDistanceKM(File2['LATITUDE'][j], File2['LONGITUDE'][j],
                                                      File1['Latitude'][i],File1['Longitude'][I])
                        File3 = File3.append({'DistanceBetweenTwoPoints' : (distance) })

谢谢。

I'm looking for a faster way to optimize my python code to calculate the distance between two GPS points, longitude, and latitude. Here is my code and I want to optimize it to work faster.

 def CalcDistanceKM(lat1, lon1, lat2, lon2):
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        distance = 6371 * c

        return distance

The behavior of this code is to calculate a distance between two latitudes, and longitudes from two different excel (CSV files), and return the distance between them.

A more code to explain the behavior:

for i in range(File1):
            for j in range(File2):
                if File1['AA'][i] == File2['BB'][j]:
                            distance = CalcDistanceKM(File2['LATITUDE'][j], File2['LONGITUDE'][j],
                                                      File1['Latitude'][i],File1['Longitude'][I])
                        File3 = File3.append({'DistanceBetweenTwoPoints' : (distance) })

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

温折酒 2025-01-28 02:55:27

将您的点准备到 numpy 数组中,然后使用准备好的数组调用此半正弦函数一次,以利用 c 性能和矢量化优化 - 两者都是来自出色的 numpy 库的免费赠品:


def haversine(x1: np.ndarray,
              x2: np.ndarray,
              y1: np.ndarray,
              y2: np.ndarray
              ) -> np.ndarray:
    """
    input in degrees, arrays or numbers.
    
    compute haversine distance between coords (x1, y1) and (x2, y2)
    Parameters
    ----------
    x1 : np.ndarray
        X/longitude in degrees for coords pair 1
    x2 : np.ndarray
        Y/latitude in degrees for coords pair 1.
    y1 : np.ndarray
        X/longitude in degrees for coords pair 2.
    y2 : np.ndarray
        Y/latitude in degrees for coords pair 2.
    Returns
    -------
    np.ndarray or float
        haversine distance (meters) between the two given points. 
    """
    x1 = np.deg2rad(x1)
    x2 = np.deg2rad(x2)
    y1 = np.deg2rad(y1)
    y2 = np.deg2rad(y2)
    return 12730000*np.arcsin(((np.sin((y2-y1)*0.5)**2) + np.cos(y1)*np.cos(y2)*np.sin((x2-x1)*0.5)**2)**0.5)

我在 File1 和 File 2 中看到您正在重复迭代,是吗?在那里寻找匹配项? for 循环非常慢,因此这将是一个很大的瓶颈,但如果没有关于正在使用的 csv 以及 file1 中的记录如何与 file2 匹配的更多信息,我无能为力。也许将两个文件中的前几条记录添加到问题中以提供一些上下文?

更新:
感谢您提供 colab 链接。

您从两个数据帧drive_test 和Cells 开始。您的“if”条件之一:

if drive_test['Serving Cell Identity'][i] == Cells['CI'][j] \
  or drive_test['Serving Cell Identity'][i] == Cells['PCIG'][j] \
  and drive_test['E_ARFCN'][i] == Cells['EARFCN_DL'][j]:
# btw this is ambiguous, use bracket, python reads this as (a or b) and c but that may not be the intention.

可以根据交叉合并的这种方法编写为 pandas 合并和过滤器 创建两个二维 pandas 数据框的组合

new_df = drive_test.assign(merge_key = 1).merge(Cells.assign(merge_key = 1), on = 'merge_key', suffixes = ("", "")).drop('merge_key', axis = 1)
# will need to use suffixes if your dataframes have common column names

cond1_df = new_df[((new_df['Serving Cell Identity'] == new_df.CI) | (new_df['Serving Cell Identity'] == new_df.PCIG)) & (new_df.E_ARFCN == new_df.EARFCN_DL)]
cond1_df = cond1_df.assign(distance_between = haversine(cond1_df.Longitude.to_numpy(), cond1_df.LONGITUDE.to_numpy(), cond1_df.Latitude.to_numpy(), cond1_df.LATITUDE.to_numpy()))
# note that my haversine input args are differently ordered to yours

,然后您应该获得第一个条件的所有结果,并且可以对其余条件重复此操作。我无法在您的 csv 上测试这一点,因此可能需要一些调试,但这个想法应该没问题。

请注意,根据您的 csv 有多大,这可能会爆炸成一个非常大的数据帧并耗尽您的 RAM,在这种情况下,您几乎只能逐一迭代它,除非您想制作一种分段方法,其中您迭代一个数据帧中的列,并根据另一数据帧中的条件匹配所有列。这仍然比一次迭代两个更快,但可能比一次全部迭代慢。

更新 - 尝试第二个想法,因为新的数据帧似乎使内核崩溃

在您的循环中,您可以对第一个条件执行类似的操作(对于所有下一个匹配条件类似),

for i in range(drive_test_size):
  matching_records = Cells[((Cells.CI == drive_test['Serving Cell Identity'][i]) | (Cells.PCIG == drive_test['Serving Cell Identity'][i])) & (Cells.EARFCN_DL == drive_test['E_ARFCN'][i])]
  if len(matching_records) > 0:
    matching_records = matching_records.assign(distance_between = haversine(matching_records.Longitude.to_numpy(), matching_records.LONGITUDE.to_numpy(), matching_records.Latitude.to_numpy(), matching_records.LATITUDE.to_numpy()))

无论如何,这应该相当快,因为​​您将仅使用 1 个 python“for”循环,然后让超快的 numpy/pandas 查询执行下一步。该模板也应该适用于您的其余条件。

prepare your points into numpy arrays and then call this haversine function once with the prepared arrays to take advantage of c performance and vectorisation optimisations - both freebies from the brilliant numpy library:


def haversine(x1: np.ndarray,
              x2: np.ndarray,
              y1: np.ndarray,
              y2: np.ndarray
              ) -> np.ndarray:
    """
    input in degrees, arrays or numbers.
    
    compute haversine distance between coords (x1, y1) and (x2, y2)
    Parameters
    ----------
    x1 : np.ndarray
        X/longitude in degrees for coords pair 1
    x2 : np.ndarray
        Y/latitude in degrees for coords pair 1.
    y1 : np.ndarray
        X/longitude in degrees for coords pair 2.
    y2 : np.ndarray
        Y/latitude in degrees for coords pair 2.
    Returns
    -------
    np.ndarray or float
        haversine distance (meters) between the two given points. 
    """
    x1 = np.deg2rad(x1)
    x2 = np.deg2rad(x2)
    y1 = np.deg2rad(y1)
    y2 = np.deg2rad(y2)
    return 12730000*np.arcsin(((np.sin((y2-y1)*0.5)**2) + np.cos(y1)*np.cos(y2)*np.sin((x2-x1)*0.5)**2)**0.5)

I see in File1 and File 2 you are iterating both repeatedly, are you searching for matches there? for loops are very slow so that will be a big bottleneck but without a bit more information on the csv's being used and how records in file1 are matched with file2 I can't help with that. Maybe add the first couple of records from both files to the question to give it a bit of context?

update:
Thanks for including the colab link.

You start with two dataframes drive_test and Cells. one of your "if" conditions:

if drive_test['Serving Cell Identity'][i] == Cells['CI'][j] \
  or drive_test['Serving Cell Identity'][i] == Cells['PCIG'][j] \
  and drive_test['E_ARFCN'][i] == Cells['EARFCN_DL'][j]:
# btw this is ambiguous, use bracket, python reads this as (a or b) and c but that may not be the intention.

can be written as a pandas merge and filter, based on this method of a cross merge Create combination of two pandas dataframes in two dimensions

new_df = drive_test.assign(merge_key = 1).merge(Cells.assign(merge_key = 1), on = 'merge_key', suffixes = ("", "")).drop('merge_key', axis = 1)
# will need to use suffixes if your dataframes have common column names

cond1_df = new_df[((new_df['Serving Cell Identity'] == new_df.CI) | (new_df['Serving Cell Identity'] == new_df.PCIG)) & (new_df.E_ARFCN == new_df.EARFCN_DL)]
cond1_df = cond1_df.assign(distance_between = haversine(cond1_df.Longitude.to_numpy(), cond1_df.LONGITUDE.to_numpy(), cond1_df.Latitude.to_numpy(), cond1_df.LATITUDE.to_numpy()))
# note that my haversine input args are differently ordered to yours

and then you should have all the results for the first condition, and this can be repeated for the remaining conditions. I'm not able to test this on your csvs so it might need a little bit of debugging but the idea should be fine.

note, depending on how big your csvs are, this could explode into an extremely big dataframe and max out your RAM, in which case you are pretty much stuck with iterating it one by one as you are unless you wanted to make a piecewise method where you iterate columns in one dataframe and match all columns subject to the conditions in the other. this will still be faster than iterating both one at a time but probably slower so than doing it all at once.

update - trying the second idea since the new dataframe seems to crash the kernel

In your loop, you can do something like this for the first condition (and similar for all the next matching conditions)

for i in range(drive_test_size):
  matching_records = Cells[((Cells.CI == drive_test['Serving Cell Identity'][i]) | (Cells.PCIG == drive_test['Serving Cell Identity'][i])) & (Cells.EARFCN_DL == drive_test['E_ARFCN'][i])]
  if len(matching_records) > 0:
    matching_records = matching_records.assign(distance_between = haversine(matching_records.Longitude.to_numpy(), matching_records.LONGITUDE.to_numpy(), matching_records.Latitude.to_numpy(), matching_records.LATITUDE.to_numpy()))

which should be quite considerably faster anyway since you'll be using just 1 python "for" loop and then letting the superfast numpy/pandas query do the next. this template should also be applicable to your remaining conditions.

燃情 2025-01-28 02:55:27

我建议从pyproj ...
由于pyproj是C ++ proj库的接口,因此与纯Python相比,我期望有很大的速度...

from pyproj import CRS
geod = CRS.from_epsg(4326).get_geod()

lons, lats = [11, 12, 13, 14], [11, 12, 13, 14]

tot_distance = geod.line_length(lons, lats)
intermediate_distances = geod.line_lengths(lons, lats)

print("tot_distance =", tot_distance)
print("intermediate_distances =", intermediate_distances )
>>> tot_distance = 465249.2859017318
>>> intermediate_distances = [155366.4523864174, 155090.444205422, 154792.3893098924]

I'd suggest to have a look at the geod module from pyproj...
Since pyproj is an interface to the c++ Proj library I'd expect a major speedup compared to pure python...

https://pyproj4.github.io/pyproj/stable/examples.html#geodesic-line-length

from pyproj import CRS
geod = CRS.from_epsg(4326).get_geod()

lons, lats = [11, 12, 13, 14], [11, 12, 13, 14]

tot_distance = geod.line_length(lons, lats)
intermediate_distances = geod.line_lengths(lons, lats)

print("tot_distance =", tot_distance)
print("intermediate_distances =", intermediate_distances )
>>> tot_distance = 465249.2859017318
>>> intermediate_distances = [155366.4523864174, 155090.444205422, 154792.3893098924]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文