在R中，如何根据一列中的重复值保持第一行的第一个出现？

发布于 2025-02-06 18:36:17 字数 1136 浏览 3 评论 0原文

我想将行以第一次出现在列中的更改值（下面的最后一列）保持。我的数据框是XTS对象。

在下面的示例中，我将在最后一列中保留第一行，而不是接下来的两行，因为它们与第一个2相比没有变化。然后，我将接下来的三行（序列323）保留，因为它们会更改每次，并删除下一个4，因为它们没有变化，依此类推。最终的数据框架看起来像是在原始数据之下。

任何帮助都将受到赞赏！

原始数据帧

2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4

最终数据框

2007-01-31 2.72   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-11-30 5.75   4.76        4
2008-04-30 6.86   4.91        1
2008-07-31 8.00   5.13        4

原文

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.

In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.

Any help is appreciated!

Original Dataframe

2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4

Final Dataframe

2007-01-31 2.72   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-11-30 5.75   4.76        4
2008-04-30 6.86   4.91        1
2008-07-31 8.00   5.13        4

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

破晓 2025-02-13 18:36:17

这是另一个使用运行长度编码 rle（rle（rle（rle（rle（rle（rle（rle（rle（rle（rle（rle（rle））） /代码>。

lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]

输出：

           V1   V2   V3 V4
1  2007-01-31 2.72 4.75  2
4  2007-04-30 2.74 4.75  3
5  2007-05-31 2.46 4.75  2
6  2007-06-30 2.98 4.75  3
11 2007-11-30 5.75 4.76  4
16 2008-04-30 6.86 4.91  1
19 2008-07-31 8.00 5.13  4

Here's another solution using run length encoding rle().

lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]

Output:

           V1   V2   V3 V4
1  2007-01-31 2.72 4.75  2
4  2007-04-30 2.74 4.75  3
5  2007-05-31 2.46 4.75  2
6  2007-06-30 2.98 4.75  3
11 2007-11-30 5.75 4.76  4
16 2008-04-30 6.86 4.91  1
19 2008-07-31 8.00 5.13  4

回复收藏 0 原文

舟遥客 2025-02-13 18:36:17

您可以使用data.table :: Shift进行过滤，再加上第一行，rbind

library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])

或使用dplyr输出的等效方法

library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))

：

         date    v1    v2    v3
       <IDat> <num> <num> <int>
1: 2007-01-31  2.72  4.75     2
2: 2007-04-30  2.74  4.75     3
3: 2007-05-31  2.46  4.75     2
4: 2007-06-30  2.98  4.75     3
5: 2007-11-30  5.75  4.76     4
6: 2008-04-30  6.86  4.91     1
7: 2008-07-31  8.00  5.13     4

You can use data.table::shift to filter, plus the first row, in rbind

library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])

Or an equivalent approach using dplyr

library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))

Output:

         date    v1    v2    v3
       <IDat> <num> <num> <int>
1: 2007-01-31  2.72  4.75     2
2: 2007-04-30  2.74  4.75     3
3: 2007-05-31  2.46  4.75     2
4: 2007-06-30  2.98  4.75     3
5: 2007-11-30  5.75  4.76     4
6: 2008-04-30  6.86  4.91     1
7: 2008-07-31  8.00  5.13     4

回复收藏 0 原文

笑咖 2025-02-13 18:36:17

数据

x <- "
2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4
"
df <- read.table(textConnection(x) , header = F)

并使用这两条线

df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]

#>            V1   V2   V3 V4
#> 1  2007-01-31 2.72 4.75  2
#> 4  2007-04-30 2.74 4.75  3
#> 5  2007-05-31 2.46 4.75  2
#> 6  2007-06-30 2.98 4.75  3
#> 11 2007-11-30 5.75 4.76  4
#> 16 2008-04-30 6.86 4.91  1
#> 19 2008-07-31 8.00 5.13  4

^{在2022-06-12上由（v2.0.1）}

DATA

x <- "
2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4
"
df <- read.table(textConnection(x) , header = F)

and use this two lines

df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]

#>            V1   V2   V3 V4
#> 1  2007-01-31 2.72 4.75  2
#> 4  2007-04-30 2.74 4.75  3
#> 5  2007-05-31 2.46 4.75  2
#> 6  2007-06-30 2.98 4.75  3
#> 11 2007-11-30 5.75 4.76  4
#> 16 2008-04-30 6.86 4.91  1
#> 19 2008-07-31 8.00 5.13  4

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~