熊猫合并两个数据范围的最快方法
我正在将两个文件合并到完全匹配的时间戳上。这些是字段数据。我试图找到准确合并它们的最快方法。我在这里列出了我的反复试验结果。我想知道那里的任何更好,快速的方法。请记住,File1的行约为2000万行,File2具有200万行。您猜对了,我不是,也不能在这里共享。 我的代码:
file1.shape
Out[13]: (19069591, 11) # 19.1 million rows
file2.shape
Out[14]: (1987321, 44) # 1.9 million rows
方法1:pd.merge(正确的结果但较慢)
%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
df.shape
Out[17]: (1776798, 55) # 1.7 million rows
methot2:pd.merge_asof(快4倍但错误结果) 此方法需要对两个文件进行排序。因此,另外两行代码。但是,这两条线在2或3 s中执行,与Method1相比,该方法2的%timeit
file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)
%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)
肯定更快。但是有一个问题。方法1结果是正确的。因为与File1(1.9)和File2(19)相比,其最终的DF
尺寸为170万行的尺寸较小。尽管Method2更快,但结果是错误的,因为它考虑了两个数据帧的所有行。这里有两个问题:1。如何从方法2方法中获得确切的匹配? 2。是否有比这两个方法更快的方法?
解决方案更新基于Pavel
建议:
%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.shape
Out[26]: (19069591, 55)
I am merging two files on exactly matching timestamps. These are field data. I am trying to find fastest way of merging them accurately. I am listing here my trial and error results. I would like to know any better and fast method there. Remember, file1 has around 20 million rows, and file2 has 2 million. You guessed it correctly, I am not and cannot share them here.
My code:
file1.shape
Out[13]: (19069591, 11) # 19.1 million rows
file2.shape
Out[14]: (1987321, 44) # 1.9 million rows
Method1: pd.merge (Correct result but slower)
%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
df.shape
Out[17]: (1776798, 55) # 1.7 million rows
Method2: pd.merge_asof(4 times faster but wrong result)
This method needs sorting of both files. Hence, two additional lines of code. But these two lines executed with in 2 or 3 s, not counted towards the %timeit
of this method
file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)
%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)
Method2 is definitely faster compared to Method1. But there is an issue. Method1 result is correct. Because its resultant df
size 1.7 million rows is lesser in size compared to file1 (1.9) and file2 (19). Though Method2 is faster but result is wrong as it considers all rows from the both data frames. Two questions here: 1. How to get exact matches out of the Method2 approach? 2. Is there any faster method than these two?
Solution updates Based on pavel
suggestion:
%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.shape
Out[26]: (19069591, 55)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论