熊猫数据框架 - 删除重叠间隔

发布于 2025-01-28 22:18:53 字数 797 浏览 1 评论 0原文

假设您有一个可以使用以下代码创建的熊猫数据框架：

test_df = pd.DataFrame(
    {'start_date': ['2021-07-01', '2021-07-02', '2021-07-03',
                    '2021-07-04', '2021-07-05', '2021-07-06'],
     'end_date': ['2021-07-03', '2021-07-04', '2021-07-05',
                  '2021-07-06', '2021-07-07', '2021-07-08'],
     'returns': [1, 1, 0.99, 0.98, 0.99, 0.97]})
test_df = test_df.sort_values('returns', ascending=False)

假设返回总是被整理到删除重叠间隔的有效方法是什么？我不想使用循环，因为数据集很大，是否有一个矢量化方法可以实现以下输出？

预期输出

+------------+------------+---------+
| start_date |  end_date  | returns |
+------------+------------+---------+
| 2021-07-01 | 2021-07-03 |       1 |
| 2021-07-05 | 2021-07-07 |    0.99 |
+------------+------------+---------+

原文

Suppose that you have a Pandas data frame that can be created using code below:

test_df = pd.DataFrame(
    {'start_date': ['2021-07-01', '2021-07-02', '2021-07-03',
                    '2021-07-04', '2021-07-05', '2021-07-06'],
     'end_date': ['2021-07-03', '2021-07-04', '2021-07-05',
                  '2021-07-06', '2021-07-07', '2021-07-08'],
     'returns': [1, 1, 0.99, 0.98, 0.99, 0.97]})
test_df = test_df.sort_values('returns', ascending=False)

Assuming that returns are always sorted what would be the efficient way to remove overlapping intervals? I do not want to use loops as the data set is large is there a vectorized approach to achieve the output below?

Expected Output

+------------+------------+---------+
| start_date |  end_date  | returns |
+------------+------------+---------+
| 2021-07-01 | 2021-07-03 |       1 |
| 2021-07-05 | 2021-07-07 |    0.99 |
+------------+------------+---------+

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戏剧牡丹亭 2025-02-04 22:18:53

@mozway 在这里使用 <<<代码> np.triu（）（numpy的上三角）。

为您的数据框架稍微更改它：

import numpy as np
test_df["start_date"] = pd.to_datetime(test_df["start_date"])
test_df["end_date"] = pd.to_datetime(test_df["end_date"])

a = np.triu(test_df['end_date'].values > test_df['start_date'].values[:, None])
b = np.triu(test_df['start_date'].values < test_df['end_date'].values[:, None])
test_df[(a & b).sum(0) == 1]
#  start_date   end_date  returns
#0 2021-07-01 2021-07-03     1.00
#4 2021-07-05 2021-07-07     0.99

说明：

test_df['end_date'].values > test_df['start_date'].values[:, None]
#array([[ True,  True,  True,  True,  True,  True],
#       [ True,  True,  True,  True,  True,  True],
#       [False,  True,  True,  True,  True,  True],
#       [False, False, False,  True,  True,  True],
#       [False, False,  True,  True,  True,  True],
#       [False, False, False,  True, False,  True]])

这返回一个数组，其中end_date值大于start_date值。每个开始日期作为行。

这查看开始日期和结束日期的每种组合（每个结束日期作为列，这意味着只有开始日期与结束日期或之前的组合（根据test_df行）为true，其余的 false false

B在所有实例中，结束日期大于开始日期（每个开始日期为列，每个末端日期都作为第一个行）。 test_df ['start_date']。值＆lt; test_df ['end_date']。值[：，none]，它正在查看第一个结束日期的组合是否大于开始的

日期

。代码> test_df [（a＆amp; b）.sum（0）== 1] ：

（a＆amp; b）是一个数组，其中true在两个数组均为true的情况下发生。
（a＆amp; b）.sum（0）总和此数组的每一列中，其中true == 1和false == 0。
（a＆amp; b）.sum（0）== 1是行，其中只有一个出现a and b 为b > true 一起。我们只想要这些，因为我们想要行的位置：

开始日期小于以前（或当前）结束日期

，

结束日期大于以前的（或当前）开始日期，

但是

如果发生多次，则必须重叠。因为领先的对角线将始终是true（作为给定数据框架行的结束日期必须始终大于开始日期的开始日期），如果此情况超过一次，这意味着：

开始日期小于以前的结束日期，当前结束日期

和

结束日期大于先前的开始日期和当前的开始日期，

这意味着这里必须有重叠。

我知道这很令人困惑，但确实有意义！

There was an answer from @mozway here that works without a loop by using np.triu() (numpy's upper triangle).

Slightly changing it for your DataFrame:

import numpy as np
test_df["start_date"] = pd.to_datetime(test_df["start_date"])
test_df["end_date"] = pd.to_datetime(test_df["end_date"])

a = np.triu(test_df['end_date'].values > test_df['start_date'].values[:, None])
b = np.triu(test_df['start_date'].values < test_df['end_date'].values[:, None])
test_df[(a & b).sum(0) == 1]
#  start_date   end_date  returns
#0 2021-07-01 2021-07-03     1.00
#4 2021-07-05 2021-07-07     0.99

Explanation:

test_df['end_date'].values > test_df['start_date'].values[:, None]
#array([[ True,  True,  True,  True,  True,  True],
#       [ True,  True,  True,  True,  True,  True],
#       [False,  True,  True,  True,  True,  True],
#       [False, False, False,  True,  True,  True],
#       [False, False,  True,  True,  True,  True],
#       [False, False, False,  True, False,  True]])

This returns an array where the end_date values are greater than the start_date values. This looks at each combination of start date and end date (with each end date as a column, and each start date as a row. If the condition is met, then True is returned.

Taking the upper triangle of this means that only combinations where the start date comes with or before the end date (in terms of test_df rows) are True, with the rest False

b on the other hand looks for all instances where the end date is greater than the start dates (with each start date as a column, and each end date as a row). For the first row of test_df['start_date'].values < test_df['end_date'].values[:, None], it is looking at combinations of the first end date 2021-07-03 and whether it is greater than the start dates.

The upper triangle of this means that only occurrences where the start date is before the end date in the dataframe are True.

Deconstructing the last line test_df[(a & b).sum(0) == 1]:

(a & b) is an array where True occurs where both arrays are True.
(a & b).sum(0) sums each column of this array where True == 1 and False == 0.
(a & b).sum(0) == 1 is rows where there is only one occurrence of a and b being True together. We only want these because we want rows where:

the start date is less than a previous (or current) end date

AND

the end date is greater than a previous (or current) start date

HOWEVER

If this occurs more than once, then there must be an overlap. Because the leading diagonal will always be True (as the end date for a given dataframe row must ALWAYS be greater than the start date) for this scenario, if this occurs more than once this means: