计算 R 中某些操作/变量的持续时间和关键数字（平均值、标准差、最小值、最大值）？

发布于 2025-01-14 01:16:52 字数 1219 浏览 1 评论 0原文

我有一个包含 100000 多行和多个变量/列的数据框，我想

根据“Y”列中的值计算某些操作的持续时间。 Y 列有多个值 0 和 1 的序列，并且每当发生操作时，都会有值 1。这个想法是计算从一系列 1 中的第一个 1（紧接在最后一个 0 之后）到最后一个值的时间差序列中的 1（就在下一个 0 之前）。对于所有 1 和 0 的每个对应行，当前运行时间的“X”列中始终有一个时间戳，因此基本上可以通过简单的减法来计算时间差：

TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE

相同的计算将重复多次对于所有不同的序列，将创建一个列出该操作的所有不同持续时间的新数据帧。

以类似的方式，对于“Z”列中的值，计算所有不同序列从 1 序列的第一个 1 到 1 序列的最后 1 个周期的平均值、标准偏差、最小值和最大值。然后将所有数据组合在一起作为一个数据帧并将其导出为 csv 文件，其中应包含“动作持续时间”、“Z 平均值”、“Z 标准”、“Z 最小值”、“Z 最大值”和“ id”列来自原始数据帧。我怎样才能在 R 中编写这样的脚本？

伪样式代码可能看起来像这样：（

for all the rows in df {
   if (number 1 in column Y) {
      from first 1 until the last 1 in a sequence: calculate TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE from column X
      ALSO from the range of first value of 1 to the last value of 1 in this sequence of 1:  calculate avg, std, min, and max for the variable Z 
   if (number 0) in column
      add new element/row to the list (including the variables of: "action duration", "Z avg", "Z std", "Z min", "Z max" and the "id") and move to the next 1

不确定伪代码中的算法是否正是我在文本中描述的，但至少我尽力在此处包含某种“代码示例” :-))

原文

I have a dataframe with 100000 + rows and multiple variables/columns from which I would like to

Calculate duration of a certain actions based on values in the column "Y". Column Y has multiple sequences of values 0 and 1 and whenever action takes place, there is values of 1. The idea would be to count a time difference from the first 1 in a sequence of ones (right after the last 0) until the final 1 in the sequence (right before next 0). For the every corresponding row of all the ones and zeros, there is always a timestamp in column "X" for the current runtime, so the time difference would basically be calculated from that with a simple substraction:

TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE

This same calculation would be repeated multiple times for all the different sequences of ones and a new dataframe listing all of the different durations for the action would be created.

In a similar manner, for the values in the column "Z", calculate average, standard deviation, min and max from the period of first 1 of a sequence of ones until the final 1 of a sequence of ones for all of the different sequences. Then combine all the data together as one dataframe and export it as a csv-file, which should include variables for "action durations", "Z avg", "Z std", "Z min", "Z max" and the "id" column from the original dataframe. How could I write script like this in R?

The pseudo style code could probably look something like this:

for all the rows in df {
   if (number 1 in column Y) {
      from first 1 until the last 1 in a sequence: calculate TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE from column X
      ALSO from the range of first value of 1 to the last value of 1 in this sequence of 1:  calculate avg, std, min, and max for the variable Z 
   if (number 0) in column
      add new element/row to the list (including the variables of: "action duration", "Z avg", "Z std", "Z min", "Z max" and the "id") and move to the next 1

(Not sure if the algorithm in the pseudo code is exactly what I was describing in the text, but at least I tried my best to include some kind of "code example" here as well :-))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦年海沫深 2025-01-21 01:16:52

我相信您有多个可能的连续行 1 和 0 的序列。我认为该方法是为每个序列生成一个唯一的标识符，并估计您想要的每个标识符的统计信息。使用 data.table 和 data.table::rleied 可以轻松完成此操作

library(data.table)
setDT(dt)

dt[,seqid:= rleid(Y)] %>%
  .[Y==1,.(
    seq_dur = as.numeric(max(time)-min(time)),
    meanZ = mean(Z),
    stdZ=sd(Z),
    minZ = min(Z),
    maxZ=max(Z)), by=.(seqid)]

输出：

      seqid seq_dur       meanZ      stdZ        minZ      maxZ
   1:     1       4 -0.41937718 0.7936389 -1.15956013 0.7945978
   2:     3       5 -0.17031761 0.8429274 -1.41463319 0.8502819
   3:     5      29 -0.01909116 1.1878013 -2.32238540 2.7392739
   4:     7      17 -0.14040415 0.9600719 -1.82184504 1.1401493
   5:     9       6  0.14154931 1.1930633 -1.55719089 1.6827525
  ---                                                          
2431:  4861      13 -0.17095911 0.9193558 -2.14869215 1.1571597
2432:  4863      27 -0.06239130 1.0546947 -2.46668844 2.2189060
2433:  4865      27 -0.22289381 1.0330064 -2.32818061 2.6114507
2434:  4867       2  0.42001740 0.8060201 -0.09206373 1.3491093
2435:  4869       3 -0.68025767 1.5678846 -2.96307855 0.5092229

输入：

set.seed(123)
dt = data.table(
  Y = unlist(lapply(1:10000, \(x) rep(sample(c(1,0),1), times=sample(3:8,1))))
)
dt[, time := seq(as.POSIXct("2022/1/1"),by=1, length.out=nrow(dt))]
dt[, Z:=rnorm(nrow(dt))]

head(dt)

    Y                time            Z
 1: 1 2022-01-01 00:00:00 -0.152767922
 2: 1 2022-01-01 00:00:01 -1.159560131
 3: 1 2022-01-01 00:00:02  0.794597776
 4: 1 2022-01-01 00:00:03 -1.065863531
 5: 1 2022-01-01 00:00:04 -0.513292070
 6: 0 2022-01-01 00:00:05 -1.065909875
 7: 0 2022-01-01 00:00:06 -0.643175787
 8: 0 2022-01-01 00:00:07  0.817414048
 9: 0 2022-01-01 00:00:08 -0.629111341
10: 0 2022-01-01 00:00:09  1.491066477
11: 0 2022-01-01 00:00:10  0.233849804
12: 0 2022-01-01 00:00:11 -0.007799405
13: 0 2022-01-01 00:00:12 -1.314916805
14: 0 2022-01-01 00:00:13  0.335385778
15: 0 2022-01-01 00:00:14 -0.093167347
16: 0 2022-01-01 00:00:15  0.646596214
17: 0 2022-01-01 00:00:16 -0.969331732
18: 0 2022-01-01 00:00:17  1.681191187
19: 0 2022-01-01 00:00:18  0.357307413
20: 1 2022-01-01 00:00:19 -0.940199141
21: 1 2022-01-01 00:00:20  0.059556026
22: 1 2022-01-01 00:00:21  0.098646529
23: 1 2022-01-01 00:00:22  0.324442236
24: 1 2022-01-01 00:00:23 -1.414633187
25: 1 2022-01-01 00:00:24  0.850281851
    Y                time            Z

I believe you have multiple possible sequences of consecutive rows of ones and zeros.. I think the approach is to generate a unique identifier for each sequence and the estimate the statistics you want over each of these identifiers. This is easily done using data.table, and data.table::rleied

library(data.table)
setDT(dt)

dt[,seqid:= rleid(Y)] %>%
  .[Y==1,.(
    seq_dur = as.numeric(max(time)-min(time)),
    meanZ = mean(Z),
    stdZ=sd(Z),
    minZ = min(Z),
    maxZ=max(Z)), by=.(seqid)]

Output:

      seqid seq_dur       meanZ      stdZ        minZ      maxZ
   1:     1       4 -0.41937718 0.7936389 -1.15956013 0.7945978
   2:     3       5 -0.17031761 0.8429274 -1.41463319 0.8502819
   3:     5      29 -0.01909116 1.1878013 -2.32238540 2.7392739
   4:     7      17 -0.14040415 0.9600719 -1.82184504 1.1401493
   5:     9       6  0.14154931 1.1930633 -1.55719089 1.6827525
  ---                                                          
2431:  4861      13 -0.17095911 0.9193558 -2.14869215 1.1571597
2432:  4863      27 -0.06239130 1.0546947 -2.46668844 2.2189060
2433:  4865      27 -0.22289381 1.0330064 -2.32818061 2.6114507
2434:  4867       2  0.42001740 0.8060201 -0.09206373 1.3491093
2435:  4869       3 -0.68025767 1.5678846 -2.96307855 0.5092229

Input:

set.seed(123)
dt = data.table(
  Y = unlist(lapply(1:10000, \(x) rep(sample(c(1,0),1), times=sample(3:8,1))))
)
dt[, time := seq(as.POSIXct("2022/1/1"),by=1, length.out=nrow(dt))]
dt[, Z:=rnorm(nrow(dt))]

head(dt)

    Y                time            Z
 1: 1 2022-01-01 00:00:00 -0.152767922
 2: 1 2022-01-01 00:00:01 -1.159560131
 3: 1 2022-01-01 00:00:02  0.794597776
 4: 1 2022-01-01 00:00:03 -1.065863531
 5: 1 2022-01-01 00:00:04 -0.513292070
 6: 0 2022-01-01 00:00:05 -1.065909875
 7: 0 2022-01-01 00:00:06 -0.643175787
 8: 0 2022-01-01 00:00:07  0.817414048
 9: 0 2022-01-01 00:00:08 -0.629111341
10: 0 2022-01-01 00:00:09  1.491066477
11: 0 2022-01-01 00:00:10  0.233849804
12: 0 2022-01-01 00:00:11 -0.007799405
13: 0 2022-01-01 00:00:12 -1.314916805
14: 0 2022-01-01 00:00:13  0.335385778
15: 0 2022-01-01 00:00:14 -0.093167347
16: 0 2022-01-01 00:00:15  0.646596214
17: 0 2022-01-01 00:00:16 -0.969331732
18: 0 2022-01-01 00:00:17  1.681191187
19: 0 2022-01-01 00:00:18  0.357307413
20: 1 2022-01-01 00:00:19 -0.940199141
21: 1 2022-01-01 00:00:20  0.059556026
22: 1 2022-01-01 00:00:21  0.098646529
23: 1 2022-01-01 00:00:22  0.324442236
24: 1 2022-01-01 00:00:23 -1.414633187
25: 1 2022-01-01 00:00:24  0.850281851
    Y                time            Z

回复收藏 0 原文

~没有更多了~