如何在值更改时重新编号的编号列表

发布于 2024-12-10 09:04:18 字数 1410 浏览 0 评论 0原文

我下面有 2 个数字列表（col1 和 col2）。我想添加 2 列（col3 和 col4）来执行以下操作。每次 col2 发生变化（例如从 b2 到 b3）时，col3 都会从 1 开始对 col2 进行编号。对于 col2 中的每个值最后一次出现，col4 为 TRUE。

数据按 col1 排序，然后按 col2 开始。笔记。 col2 中的值可以针对 col1 的不同值出现。（即我可以为第 1 列（a、b、c）的每个值设置 b1），

我可以在约 5000 行（约 6 秒）内正常工作，但扩展到约 100 万行时就会挂起。

这是我的代码

df$col3 <- 0
df$col4 <- FALSE
stopHere <- nrow(df)
c1 <- 'xxx'
c2 <- 'xxx'
for (i in 1:stopHere) {
  if (df[i, "col1"] != c1) {
    c2 <- 0
    c3 <- 1
    c1 <- df[i, "col1"]
  }
  if (df[i, "col2"] != c2) {
    df[i - 1, "col4"] <- TRUE
    c3 <- 1
    c2  <- df[i, "col2"]
  }
  df[i, "col3"] <- c3
  c3  <- c3 + 1
}

这是我想要的输出。

1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4 FALSE

原文

I have 2 lists of numbers (col1 & col2) below.
I'd like to add 2 columns (col3 & col4) that do the following.
col3 numbers col2 starting at 1 every time col2 changes (e.g. from b2 to b3).
col4 is TRUE on the last occurrence for each value in col2.

The data is sorted by col1, then col2 to begin.
Note. values in col2 can occur for different values of col1. (i.e. I can have b1 for every value of col 1 (a, b, c))

I can get this working fine for ~5000 rows (~6 sec), but scaling to ~1 million rows it hangs up.

Here is my code

df$col3 <- 0
df$col4 <- FALSE
stopHere <- nrow(df)
c1 <- 'xxx'
c2 <- 'xxx'
for (i in 1:stopHere) {
  if (df[i, "col1"] != c1) {
    c2 <- 0
    c3 <- 1
    c1 <- df[i, "col1"]
  }
  if (df[i, "col2"] != c2) {
    df[i - 1, "col4"] <- TRUE
    c3 <- 1
    c2  <- df[i, "col2"]
  }
  df[i, "col3"] <- c3
  c3  <- c3 + 1
}

This is my desired output.

1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4 FALSE

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别闹i 2024-12-17 09:04:18

以下是适用于您的示例数据的矢量化解决方案：

dat <- data.frame(
  V1 = rep(letters[1:3], c(10, 8, 10)),
  V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
  )

创建第 3 列和第 4 列

zz <- rle(as.character(dat$V2))$lengths
dat$V3 <- sequence(zz)
dat$V4 <- FALSE
dat$V4[head(cumsum(zz), -1)] <- TRUE

结果：

dat
   V1 V2 V3    V4
1   a b1  1 FALSE
2   a b1  2 FALSE
3   a b1  3  TRUE
4   a b2  1 FALSE
5   a b2  2  TRUE
6   a b3  1 FALSE
7   a b3  2 FALSE
8   a b3  3 FALSE
9   a b3  4 FALSE
10  a b3  5  TRUE
11  b b1  1 FALSE
12  b b1  2 FALSE
13  b b1  3 FALSE
14  b b1  4  TRUE
15  b b2  1 FALSE
16  b b2  2 FALSE
17  b b2  3 FALSE
18  b b2  4  TRUE
19  c b1  1  TRUE
20  c b2  1 FALSE
21  c b2  2 FALSE
22  c b2  3  TRUE
23  c b3  1 FALSE
24  c b3  2  TRUE
25  c b4  1 FALSE
26  c b4  2 FALSE
27  c b4  3 FALSE
28  c b4  4 FALSE

Here is a vectorized solution that works for your sample data:

dat <- data.frame(
  V1 = rep(letters[1:3], c(10, 8, 10)),
  V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
  )

Create columns 3 and 4

zz <- rle(as.character(dat$V2))$lengths
dat$V3 <- sequence(zz)
dat$V4 <- FALSE
dat$V4[head(cumsum(zz), -1)] <- TRUE

The results:

dat
   V1 V2 V3    V4
1   a b1  1 FALSE
2   a b1  2 FALSE
3   a b1  3  TRUE
4   a b2  1 FALSE
5   a b2  2  TRUE
6   a b3  1 FALSE
7   a b3  2 FALSE
8   a b3  3 FALSE
9   a b3  4 FALSE
10  a b3  5  TRUE
11  b b1  1 FALSE
12  b b1  2 FALSE
13  b b1  3 FALSE
14  b b1  4  TRUE
15  b b2  1 FALSE
16  b b2  2 FALSE
17  b b2  3 FALSE
18  b b2  4  TRUE
19  c b1  1  TRUE
20  c b2  1 FALSE
21  c b2  2 FALSE
22  c b2  3  TRUE
23  c b3  1 FALSE
24  c b3  2  TRUE
25  c b4  1 FALSE
26  c b4  2 FALSE
27  c b4  3 FALSE
28  c b4  4 FALSE

回复收藏 0 原文

画骨成沙 2024-12-17 09:04:18

一些示例数据会有所帮助。尽管如此，这应该是一个很好的起点。由于 col1 中有 3 个唯一值，col2 中有 4 个唯一值，因此 10^6 行只需要一秒钟：

n = 10^6

col1 = sample(c('a', 'b', 'c'), n, replace=T)
col2 = sample(paste('b', 1:4, sep=''), n, replace=T)

data = data.frame(col1, col2, col3=0, col4=FALSE)
data = data[do.call(order, data), ]

data$col3 = unlist(t(tapply(as.numeric(data$col2), data[,1:2], function(x) 1:length(x))))
data$col4[c(diff(data$col3), -1) < 0] = TRUE

Some example data would be helpful. Nevertheless, this should be a good place to start. With 3 unique values in col1, and 4 in col2, it only takes a second for 10^6 rows:

n = 10^6

col1 = sample(c('a', 'b', 'c'), n, replace=T)
col2 = sample(paste('b', 1:4, sep=''), n, replace=T)

data = data.frame(col1, col2, col3=0, col4=FALSE)
data = data[do.call(order, data), ]

data$col3 = unlist(t(tapply(as.numeric(data$col2), data[,1:2], function(x) 1:length(x))))
data$col4[c(diff(data$col3), -1) < 0] = TRUE

回复收藏 0 原文

如梦初醒的夏天 2024-12-17 09:04:18

首先，使起始数据可重现，并在数据框中创建 col1 和 col2 列。

dat <- read.table(textConnection(
"a   b1
a   b1
a   b1
a   b2
a   b2
a   b3
a   b3
a   b3
a   b3
a   b3
b   b1
b   b1
b   b1
b   b1
b   b2
b   b2
b   b2
b   b2
c   b1
c   b2
c   b2
c   b2
c   b3
c   b3
c   b4
c   b4
c   b4
c   b4"), stringsAsFactors=FALSE)
names(dat) <- c("col1", "col2")

行程编码给出了序列的长度，因为一切都开始排序。

runs <- rle(dat$col2)

现在操纵该信息。对于长度分量中的每个元素，创建该长度的序列并将它们放在一起。 col4 的 TRUE 值的索引可以从长度的 cumsum 中获得。

dat$col3 <- unlist(sapply(runs$lengths, function(l) seq(length.out=l)))
dat$col4 <- FALSE
dat$col4[cumsum(runs$lengths)] <- TRUE

对于结果：

> dat
   col1 col2 col3  col4
1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4  TRUE

请注意，最后一行有 col4 TRUE，它与您的书面描述相匹配（一组的最后一个是 TRUE），但是与您的示例输出不匹配。我不知道你想要哪个。

First, make your starting data reproducible, and make col1 and col2 columns in a dataframe.

dat <- read.table(textConnection(
"a   b1
a   b1
a   b1
a   b2
a   b2
a   b3
a   b3
a   b3
a   b3
a   b3
b   b1
b   b1
b   b1
b   b1
b   b2
b   b2
b   b2
b   b2
c   b1
c   b2
c   b2
c   b2
c   b3
c   b3
c   b4
c   b4
c   b4
c   b4"), stringsAsFactors=FALSE)
names(dat) <- c("col1", "col2")

Run length encoding gives the lengths of your sequences, since everything is starting out sorted.

runs <- rle(dat$col2)

Now manipulate that info. For each element in the length component, create a sequence of that length and put them all together. The indicies of the TRUE values for col4 can be gotten from the cumsum of the lengths.

dat$col3 <- unlist(sapply(runs$lengths, function(l) seq(length.out=l)))
dat$col4 <- FALSE
dat$col4[cumsum(runs$lengths)] <- TRUE

For the result:

> dat
   col1 col2 col3  col4
1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4  TRUE

Note that the last line has col4 TRUE, which matches your written description (last of a set is TRUE), but does not match your example output. I don't know which you want.

回复收藏 0 原文

少女的英雄梦 2024-12-17 09:04:18

这个解决方案不需要任何循环，也不需要 rle 或其他巧妙的函数；只是单纯的合并和聚合函数。

首先准备数据（使用 Andrie 的代码）：

df <- data.frame(
  x = rep(letters[1:3], c(10, 8, 10)),
  y = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
)

解决方案：

minmax <- with(df, merge(
                aggregate(seq(x), by = list(x = x, y = y), min),
                aggregate(seq(x), by = list(x = x, y = y), max)
          ))

names(minmax)[3:4] = c("min", "max") # unique pairs with min/max global order

result <- with(merge(df, minmax), 
    data.frame(x, y, count = seq(x) - min + 1, last = seq(x) == max))

该解决方案假设输入按照您所说的方式排序，但可以轻松修改以处理未排序的表（并保持它们未排序）。

This solution doesn't need any loops, nor rle or other clever functions; just mere merge and aggregate functions.

Preparing your data (used Andrie's code) first:

df <- data.frame(
  x = rep(letters[1:3], c(10, 8, 10)),
  y = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
)

The solution:

minmax <- with(df, merge(
                aggregate(seq(x), by = list(x = x, y = y), min),
                aggregate(seq(x), by = list(x = x, y = y), max)
          ))

names(minmax)[3:4] = c("min", "max") # unique pairs with min/max global order

result <- with(merge(df, minmax), 
    data.frame(x, y, count = seq(x) - min + 1, last = seq(x) == max))

This solution assumes that the input is sorted as you said, but can be easily modified to work on unsorted tables (and keep them unsorted).

回复收藏 0 原文

~没有更多了~