按重叠时间范围对 pandas 数据帧进行分区

发布于 2025-01-11 00:46:35 字数 258 浏览 2 评论 0原文

我有一个带有日期时间列(以及许多其他列)的 pandas 数据框。我想将其划分为一段持续时间(例如 10 秒)的帧,并具有一些预定义的重叠(例如 2 秒)。因此,我想有效地将​​数据帧分区为每 8 秒(= 10 秒持续时间 - 2 秒重叠),但在该分区中收集接下来 10 秒的数据。所以我想要对应于时间 (0, 10), (8, 18), (16, 26)... 等的分区。我怎样才能有效地做到这一点?

据我了解,Grouper 中的频率可以根据时间进行分区,但无法按照我的要求处理重叠。

I have a pandas data frame with a datetime column (and a number of other columns). I want to partition it into frames of some time duration, say 10 seconds, with some predefined overlap, say 2 seconds. So effectively I want to partition the data frame into every 8 seconds (= 10 seconds duration - 2 seconds overlap) but collecting data for next 10 seconds in that partition. So I want partitions corresponding to time (0, 10), (8, 18), (16, 26)... and so on. How can I do it effectively?

As I understand the frequency in Grouper can do partitioning based on time but cannot handle the overlap as I require it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沧笙踏歌 2025-01-18 00:46:35

假设源DataFrame已创建为:

np.random.seed(0)    # To get repeatable result
i = pd.date_range('2022-01-01 08:00:00', '2022-01-01 08:00:30', freq='S')
df = pd.DataFrame(index=i, data = {'Amount': np.random.randint(0, 100, len(i))})

因此可以根据索引来选择时间范围,操作速度相当快。
如果你的DataFrame有其他索引,你应该将时间列设置为索引。

然后,根据您的定义对重叠切片执行某些操作,
您可以运行例如:

for t1 in pd.date_range(df.index.min(), df.index.max()
        + pd.Timedelta('1S'), freq='8S'):
    t2 = t1 + pd.Timedelta('9S')
    print(f'Range [{t1},  {t2}]')
    print(df.loc[t1:t2])
    print('----')

在上面的代码中,当前切片只有一个 print 操作。
我得到的前 2 个切片是:

Range [2022-01-01 08:00:00,  2022-01-01 08:00:09]
                     Amount
2022-01-01 08:00:00      44
2022-01-01 08:00:01      47
2022-01-01 08:00:02      64
2022-01-01 08:00:03      67
2022-01-01 08:00:04      67
2022-01-01 08:00:05       9
2022-01-01 08:00:06      83
2022-01-01 08:00:07      21
2022-01-01 08:00:08      36
2022-01-01 08:00:09      87
----
Range [2022-01-01 08:00:08,  2022-01-01 08:00:17]
                     Amount
2022-01-01 08:00:08      36
2022-01-01 08:00:09      87
2022-01-01 08:00:10      70
2022-01-01 08:00:11      88
2022-01-01 08:00:12      88
2022-01-01 08:00:13      12
2022-01-01 08:00:14      58
2022-01-01 08:00:15      65
2022-01-01 08:00:16      39
2022-01-01 08:00:17      87
----

在代码的目标版本中,您可以:

  1. 定义一个返回 df 切片的函数:

    def getSlice(t1):
        t2 = t1 + pd.Timedelta('9S')
        返回 df.loc[t1:t2]
    
  2. 在列表理解中生成切片列表:

    slices = [ getSlice(t1) for t1 in pd.date_range(df.index.min(),
        df.index.max() + pd.Timedelta('1S'), freq='8S') ]
    

当您打印 slices[0]slices[1] 时,您应该得到相同的结果
结果如上打印。

Assume that the source DataFrame has been created as:

np.random.seed(0)    # To get repeatable result
i = pd.date_range('2022-01-01 08:00:00', '2022-01-01 08:00:30', freq='S')
df = pd.DataFrame(index=i, data = {'Amount': np.random.randint(0, 100, len(i))})

So the time range to select can be based on the index, which operates quite quick.
If your DataFrame has other index, you should set the time column as the index.

Then, to perform some action with overlapping slices according to your definition,
you can run e.g.:

for t1 in pd.date_range(df.index.min(), df.index.max()
        + pd.Timedelta('1S'), freq='8S'):
    t2 = t1 + pd.Timedelta('9S')
    print(f'Range [{t1},  {t2}]')
    print(df.loc[t1:t2])
    print('----')

In the above code there is only a print action for the current slice.
The first 2 slices I got are:

Range [2022-01-01 08:00:00,  2022-01-01 08:00:09]
                     Amount
2022-01-01 08:00:00      44
2022-01-01 08:00:01      47
2022-01-01 08:00:02      64
2022-01-01 08:00:03      67
2022-01-01 08:00:04      67
2022-01-01 08:00:05       9
2022-01-01 08:00:06      83
2022-01-01 08:00:07      21
2022-01-01 08:00:08      36
2022-01-01 08:00:09      87
----
Range [2022-01-01 08:00:08,  2022-01-01 08:00:17]
                     Amount
2022-01-01 08:00:08      36
2022-01-01 08:00:09      87
2022-01-01 08:00:10      70
2022-01-01 08:00:11      88
2022-01-01 08:00:12      88
2022-01-01 08:00:13      12
2022-01-01 08:00:14      58
2022-01-01 08:00:15      65
2022-01-01 08:00:16      39
2022-01-01 08:00:17      87
----

In the target version of code you can:

  1. Define a function returning a slice of df:

    def getSlice(t1):
        t2 = t1 + pd.Timedelta('9S')
        return df.loc[t1:t2]
    
  2. Generate the list of slices in a list comprehension:

    slices = [ getSlice(t1) for t1 in pd.date_range(df.index.min(),
        df.index.max() + pd.Timedelta('1S'), freq='8S') ]
    

When you print slices[0] and slices[1], you should get just the same
result as printed above.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文