如何删除一行中相对于另一行在特定时间内的日期时间值?

发布于 2025-01-11 23:28:37 字数 792 浏览 0 评论 0原文

如果我有一个如下所示的数据框:

LetterTime
0x2021-01-01 14:00:00
1y2021-01-01 18:00:00
2y2021-01-03 14:00:00

如果时间中有一个值,我将如何删除一行列(日期时间)是否距上行时间 14 小时之内?

我尝试过使用:

from datetime import datetime, timedelta
for i, row in enumerate(df):
    if i > 0:
        if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):
            df = df.drop(i)
        else:
            pass
    else:
        pass

但我得到与该行相关的 KeyError 1

if df.at[i, '时间'] - df.at[i-1, '时间']

时间增量(小时=14):

If I have a DataFrame as below:

LetterTime
0x2021-01-01 14:00:00
1y2021-01-01 18:00:00
2y2021-01-03 14:00:00

How would I delete a row if a value in the Time column(datetime) is within say 14 hours from the time in the row above?

I've tried using:

from datetime import datetime, timedelta
for i, row in enumerate(df):
    if i > 0:
        if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):
            df = df.drop(i)
        else:
            pass
    else:
        pass

but I get KeyError 1 in relation to the line

if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乱世争霸 2025-01-18 23:28:37

如果某个时间戳与较早的时间戳相差 14 小时之内,则其删除是否取决于较早的时间戳是否被删除?该答案考虑了该问题的答案为“是”的情况。 (如果答案为“否”,则下面测试数据的结果解决方案将仅为第一个时间戳)。

设置

测试数据:

import pandas as pd

timestamps = pd.Series([0, 6,10,14,16,29,33,45,46]).apply(pd.Timedelta, unit="hours") + pd.Timestamp("2022")

时间戳如下所示:

0   2022-01-01 00:00:00
1   2022-01-01 06:00:00
2   2022-01-01 10:00:00
3   2022-01-01 14:00:00
4   2022-01-01 16:00:00
5   2022-01-02 05:00:00
6   2022-01-02 09:00:00
7   2022-01-02 21:00:00
8   2022-01-02 22:00:00
dtype: datetime64[ns]

我们的目标解决方案由第1、第4、第6和第8时间戳组成。

解决方案

该解决方案将使用 piso (pandas 间隔设置操作)包裹。这个想法是为每个时间戳创建一个 14 小时窗口(即间隔),并迭代删除属于较早开始的间隔的时间戳。

import piso

# sort timestamps if not already sorted
timestamps = timestamps.sort_values()

# create 14 hour windows for each timestamp.  Can be left-closed or right-closed, but not both
intervals = pd.IntervalIndex.from_arrays(timestamps, timestamps+pd.Timedelta("14h"))

# create the "disjoint adjacency matrix", which indicates pairwise if intervals are disjoint
mat = piso.adjacency_matrix(intervals, edges="disjoint")

mat 将是一个数据帧,其索引和列是时间戳。 mat.values 看起来像

array([[False, False, False,  True,  True,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False, False,  True,  True,  True],
       [ True,  True,  True,  True, False, False, False,  True,  True],
       [ True,  True,  True,  True,  True, False, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False]])

将此矩阵的对角线设置为 True

mat.iloc[range(len(mat)),range(len(mat))] = True

我们将从第一个间隔开始。从 mat 的第一行您可以推断出需要删除第二个和第三个间隔。因此,我们过滤掉与这些间隔对应的行和列,然后移动下一个间隔(行),依此类推,直到到达最后一行。请注意,我们不需要检查最后一行的任何交集。

i = 0
while i < len(mat) -1:
    mat = mat.loc[mat.iloc[i],mat.iloc[i]]
    i+=1

结果将是一个值全部为 True 的数据框。更重要的是,索引(和列)将是间隔,其左端点是删除 14 小时内的时间戳后剩余的时间戳。

pd.Series(mat.index.left) 给出

0   2022-01-01 00:00:00
1   2022-01-01 14:00:00
2   2022-01-02 05:00:00
3   2022-01-02 21:00:00
dtype: datetime64[ns]

您可以使用它来使用 pandas.Series.isin note 过滤原始数据帧

:我是piso的创造者。如果您有任何反馈或问题,请随时与我们联系。

If a timestamp is within 14hours of an earlier timestamp, does its removal depend on whether the earlier timestamp is removed or not? This answer considers the situation where the answer to this question is "yes". (If the answer is "no" then the resulting solution for the test data below would be the first timestamp only).

setup

test data:

import pandas as pd

timestamps = pd.Series([0, 6,10,14,16,29,33,45,46]).apply(pd.Timedelta, unit="hours") + pd.Timestamp("2022")

timestamps looks like this:

0   2022-01-01 00:00:00
1   2022-01-01 06:00:00
2   2022-01-01 10:00:00
3   2022-01-01 14:00:00
4   2022-01-01 16:00:00
5   2022-01-02 05:00:00
6   2022-01-02 09:00:00
7   2022-01-02 21:00:00
8   2022-01-02 22:00:00
dtype: datetime64[ns]

The solution we are aiming for consists of the 1st, 4th, 6th and 8th timestamps.

solution

This solution will use piso (pandas interval set operations) package. The idea is to create a 14hr window, i.e. interval, for each of your timestamps and iteratively remove timestamps which belong to intervals starting earlier.

import piso

# sort timestamps if not already sorted
timestamps = timestamps.sort_values()

# create 14 hour windows for each timestamp.  Can be left-closed or right-closed, but not both
intervals = pd.IntervalIndex.from_arrays(timestamps, timestamps+pd.Timedelta("14h"))

# create the "disjoint adjacency matrix", which indicates pairwise if intervals are disjoint
mat = piso.adjacency_matrix(intervals, edges="disjoint")

mat will be a dataframe, whose index and columns are timestamps. mat.values looks like this

array([[False, False, False,  True,  True,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False, False,  True,  True,  True],
       [ True,  True,  True,  True, False, False, False,  True,  True],
       [ True,  True,  True,  True,  True, False, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False]])

set diagonal of this matrix to True

mat.iloc[range(len(mat)),range(len(mat))] = True

We will start with the first interval. From the first row of mat you can deduce that the second and third interval need to be dropped. So we filter out the rows and columns corresponding to these intervals, then move the next interval (row) and so on until we reach the last row. Note we do not need to check any intersections for the last row.

i = 0
while i < len(mat) -1:
    mat = mat.loc[mat.iloc[i],mat.iloc[i]]
    i+=1

The result will be a dataframe whose values are all True. More importantly, the index (and columns) will be intervals whose left endpoints are the timestamps remaining after removing those within 14hrs.

i.e. pd.Series(mat.index.left) gives

0   2022-01-01 00:00:00
1   2022-01-01 14:00:00
2   2022-01-02 05:00:00
3   2022-01-02 21:00:00
dtype: datetime64[ns]

You can use this to filter your original dataframe using pandas.Series.isin

note: I am the creator of piso. Please feel free to reach out with feedback or questions if you have any.

揽清风入怀 2025-01-18 23:28:37

您可以使用 shift + rsub (查找连续时间之间的差异)+ div (转换为小时)创建布尔掩码并对其进行过滤:

msk = df['Time'].shift().rsub(df['Time']).div(np.timedelta64(1, 'h')) > 14
out = df[msk]

输出:

  Letter                Time
2      y 2021-01-03 14:00:00

You could create a boolean mask by using shift + rsub (finds the difference between consecutive times) + div (convert to hours) and filter it:

msk = df['Time'].shift().rsub(df['Time']).div(np.timedelta64(1, 'h')) > 14
out = df[msk]

Output:

  Letter                Time
2      y 2021-01-03 14:00:00
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文