熊猫合并两个数据范围的最快方法

发布于 2025-02-07 23:19:19 字数 1712 浏览 1 评论 0原文

我正在将两个文件合并到完全匹配的时间戳上。这些是字段数据。我试图找到准确合并它们的最快方法。我在这里列出了我的反复试验结果。我想知道那里的任何更好，快速的方法。请记住，File1的行约为2000万行，File2具有200万行。您猜对了，我不是，也不能在这里共享。我的代码：

file1.shape
Out[13]: (19069591, 11) # 19.1 million rows

file2.shape
Out[14]: (1987321, 44) # 1.9 million rows

方法1：pd.merge（正确的结果但较慢）

%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)  
df.shape
Out[17]: (1776798, 55) # 1.7 million rows

methot2：pd.merge_asof（快4倍但错误结果） 此方法需要对两个文件进行排序。因此，另外两行代码。但是，这两条线在2或3 s中执行，与Method1相比，该方法2的％timeit

file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)

%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))

8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)

肯定更快。但是有一个问题。方法1结果是正确的。因为与File1（1.9）和File2（19）相比，其最终的DF尺寸为170万行的尺寸较小。尽管Method2更快，但结果是错误的，因为它考虑了两个数据帧的所有行。这里有两个问题：1。如何从方法2方法中获得确切的匹配？ 2。是否有比这两个方法更快的方法？

解决方案更新基于Pavel建议：

%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df.shape
Out[26]: (19069591, 55)

原文

I am merging two files on exactly matching timestamps. These are field data. I am trying to find fastest way of merging them accurately. I am listing here my trial and error results. I would like to know any better and fast method there. Remember, file1 has around 20 million rows, and file2 has 2 million. You guessed it correctly, I am not and cannot share them here.
My code:

file1.shape
Out[13]: (19069591, 11) # 19.1 million rows

file2.shape
Out[14]: (1987321, 44) # 1.9 million rows

Method1: pd.merge (Correct result but slower)

%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)  
df.shape
Out[17]: (1776798, 55) # 1.7 million rows

Method2: pd.merge_asof(4 times faster but wrong result)
This method needs sorting of both files. Hence, two additional lines of code. But these two lines executed with in 2 or 3 s, not counted towards the %timeit of this method

file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)

%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))

8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)

Method2 is definitely faster compared to Method1. But there is an issue. Method1 result is correct. Because its resultant df size 1.7 million rows is lesser in size compared to file1 (1.9) and file2 (19). Though Method2 is faster but result is wrong as it considers all rows from the both data frames. Two questions here: 1. How to get exact matches out of the Method2 approach? 2. Is there any faster method than these two?

Solution updates Based on pavel suggestion:

%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df.shape
Out[26]: (19069591, 55)

分享到QQ

分享到微博