Pandas：在左外连接后用另一个日期帧中的值填充 NaN 值，并且没有公共行

发布于 2025-01-13 21:29:13 字数 1764 浏览 4 评论 0原文

考虑以下两个数据帧：

df1:

time                       p_value       
2022-03-12 00:00:00+00:00  38743.7
2022-03-12 00:05:00+00:00    38802
2022-03-12 00:10:00+00:00    38842
2022-03-12 00:15:00+00:00  38865.9
2022-03-12 00:20:00+00:00  38882.1
2022-03-12 23:35:00+00:00    38965
2022-03-12 23:40:00+00:00  38933.2
2022-03-12 23:45:00+00:00  38878.4
2022-03-12 23:50:00+00:00  38861.3
2022-03-12 23:55:00+00:00  38803.9

df2:

time                       b_value      
2022-03-12 00:27:12+00:00  13
2022-03-12 00:29:26+00:00  14
2022-03-12 00:31:37+00:00  15                     
2022-03-12 23:19:38+00:00  16
2022-03-12 23:33:10+00:00  17
2022-03-12 23:41:52+00:00  18

df1 包含特定日期的完整范围的日期时间索引，分辨率为 5 分钟；即它有 24*60/5 = 288 记录。

df2 的索引低于 df1 df1 df2 的索引低于 df1 288，它们可能都与 df1 中的索引不同。

问题是我们如何合并这两个数据框，以便对于 df2 的每个索引，使用 df1 中最接近它的索引值。

# Left outer join
desired_df = df2.merge(df1, on='time', how='left')

换句话说，我们如何根据另一个数据帧中的值填充 NaN 值。

所需的 df：

time                       new_col      
2022-03-12 00:27:12+00:00  38882.1
2022-03-12 00:29:26+00:00  38882.1
2022-03-12 00:31:37+00:00  38882.1                     
2022-03-12 23:19:38+00:00  38965
2022-03-12 23:33:10+00:00  38965
2022-03-12 23:41:52+00:00  38933.2

另一个想法是，对于 df2 中的每个索引，我们可以计算靠近该索引的两个连续索引之间的平均值。例如，索引 2022-03-12 23:41:52+00:00 对应的值将是 2022-03-12 23:40 的值的平均值： 00+00:00 和 2022-03-12 23:45:00+00:00。

在 JonClements 发表评论后，这是一个非常简单的解决方案！

pd.merge_asof(left=df2, right=df1, on='time')

原文

Consider the two following dataframes:

df1:

time                       p_value       
2022-03-12 00:00:00+00:00  38743.7
2022-03-12 00:05:00+00:00    38802
2022-03-12 00:10:00+00:00    38842
2022-03-12 00:15:00+00:00  38865.9
2022-03-12 00:20:00+00:00  38882.1
2022-03-12 23:35:00+00:00    38965
2022-03-12 23:40:00+00:00  38933.2
2022-03-12 23:45:00+00:00  38878.4
2022-03-12 23:50:00+00:00  38861.3
2022-03-12 23:55:00+00:00  38803.9

df2:

time                       b_value      
2022-03-12 00:27:12+00:00  13
2022-03-12 00:29:26+00:00  14
2022-03-12 00:31:37+00:00  15                     
2022-03-12 23:19:38+00:00  16
2022-03-12 23:33:10+00:00  17
2022-03-12 23:41:52+00:00  18

df1 contains full ranges of datetime indices with resolution 5 minutes for a specific day; i.e. it has 24*60/5 = 288 records.

df2 has lower indices than df1 < 288 and they may all be different from indices in the df1.

The question is how we can merge these two data frames such that for each index of df2, the value of the nearest index to it from df1 is used.

# Left outer join
desired_df = df2.merge(df1, on='time', how='left')

In other words how we can fill NaN values according to the values in another dataframe.

desired df:

time                       new_col      
2022-03-12 00:27:12+00:00  38882.1
2022-03-12 00:29:26+00:00  38882.1
2022-03-12 00:31:37+00:00  38882.1                     
2022-03-12 23:19:38+00:00  38965
2022-03-12 23:33:10+00:00  38965
2022-03-12 23:41:52+00:00  38933.2

Another idea is, for each index in df2 we can compute the mean value between two consecutive indexes which are near that index. For eaxmple, the value correspond to the the index 2022-03-12 23:41:52+00:00 would be the mean of values of 2022-03-12 23:40:00+00:00 and 2022-03-12 23:45:00+00:00.

After JonClements's comment, here is a pretty simple solution!

pd.merge_asof(left=df2, right=df1, on='time')

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄情伤 2025-01-20 21:29:13

您可以尝试使用 pandas 插值

merged = df2.merge(df1, how='outer')
merged = pd.DataFrame(merged.set_index(pd.to_datetime(merged['time'])).drop('time', axis=1).sort_index()['p_value'])

merged['new_col'] = merged.apply(lambda x: x.interpolate('nearest'))
merged

Output:
                            p_value new_col
time        
2022-03-13 00:00:00+00:00   38743.7 38743.7
2022-03-13 00:05:00+00:00   38802.0 38802.0
2022-03-13 00:10:00+00:00   38842.0 38842.0
2022-03-13 00:15:00+00:00   38865.9 38865.9
2022-03-13 00:20:00+00:00   38882.1 38882.1
2022-03-13 00:27:12+00:00   NaN 38882.1
2022-03-13 00:29:26+00:00   NaN 38882.1
2022-03-13 00:31:37+00:00   NaN 38882.1
2022-03-13 23:19:38+00:00   NaN 38965.0
2022-03-13 23:33:10+00:00   NaN 38965.0
2022-03-13 23:35:00+00:00   38965.0 38965.0
2022-03-13 23:40:00+00:00   38933.2 38933.2
2022-03-13 23:41:52+00:00   NaN 38933.2
2022-03-13 23:45:00+00:00   38878.4 38878.4
2022-03-13 23:50:00+00:00   38861.3 38861.3
2022-03-13 23:55:00+00:00   38803.9 38803.9

You could try with pandas interpolate

merged = df2.merge(df1, how='outer')
merged = pd.DataFrame(merged.set_index(pd.to_datetime(merged['time'])).drop('time', axis=1).sort_index()['p_value'])

merged['new_col'] = merged.apply(lambda x: x.interpolate('nearest'))
merged

Output:
                            p_value new_col
time        
2022-03-13 00:00:00+00:00   38743.7 38743.7
2022-03-13 00:05:00+00:00   38802.0 38802.0
2022-03-13 00:10:00+00:00   38842.0 38842.0
2022-03-13 00:15:00+00:00   38865.9 38865.9
2022-03-13 00:20:00+00:00   38882.1 38882.1
2022-03-13 00:27:12+00:00   NaN 38882.1
2022-03-13 00:29:26+00:00   NaN 38882.1
2022-03-13 00:31:37+00:00   NaN 38882.1
2022-03-13 23:19:38+00:00   NaN 38965.0
2022-03-13 23:33:10+00:00   NaN 38965.0
2022-03-13 23:35:00+00:00   38965.0 38965.0
2022-03-13 23:40:00+00:00   38933.2 38933.2
2022-03-13 23:41:52+00:00   NaN 38933.2
2022-03-13 23:45:00+00:00   38878.4 38878.4
2022-03-13 23:50:00+00:00   38861.3 38861.3
2022-03-13 23:55:00+00:00   38803.9 38803.9

回复收藏 0 原文

~没有更多了~