如何根据R中识别向量的元素为数据帧分配重复次数?

发布于 2024-12-08 07:09:59 字数 1032 浏览 0 评论 0原文

我有一个数据框,其中为个人分配了一个文本 ID,该文本 ID 将地名与个人 ID 连接起来(参见下面的数据)。最终,我需要将数据集从“长”转换为“宽”(例如,使用“重塑”),以便每个个体仅包含一行。为了做到这一点,我需要分配一个“时间”变量,重塑可以用来识别随时间变化的协变量等。我有(可能很糟糕)代码来为重复最多两次的个体执行此操作,但需要能够识别最多 18 个重复事件。如果我删除哈希前面的行,下面的代码可以正常工作,但最多只能识别两次重复。如果我保留该行(这对于重复两次以上的个人来说似乎是必要的),R 会窒息,并给出以下错误(大概是因为第一个人只重复两次):

Error in if (data$uid[i] == data$uid[i - 2]) { : 
  argument is of length zero

任何人都可以帮忙解决这个问题吗?提前致谢!

place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
data$time <- as.numeric(data$time)

#bad code
data$time[1] <- 1 #need to set first so that loop doesn't go to a row that doesn't exist     (i.e., row 0)
for (i in 2:NROW(data)){
    data$time[i] <- 1 #set first occurrence to 1
    if (data$uid[i] == data$uid[i-1]) {data$time[i] <- 2} #set second occurrence to 2, etc.
    #if (data$uid[i] == data$uid[i-2]) {data$time[i] <- 3}
    i <- i+1
}

I have a dataframe with individuals assigned a text id that concatenates a place-name with a personal id (see data, below). Ultimately, I need to do a transformation of the data set from "long" to "wide" (e.g., using "reshape") so that each individual comprises one row, only. In order to do that, I need to assign a "time" variable that reshape can use to identify time-varying covariates, etc. I have (probably bad) code to do this for individuals that repeat up to two times, but need to be able to identify up to 18 repeated occurrences. The code below works fine if I remove the line preceded by the hash, but only identifies up to two repeats. If I leave that line in (which would seem necessary for individuals repeated more than twice), R chokes, giving the following error (presumably because the first individual is repeated only twice):

Error in if (data$uid[i] == data$uid[i - 2]) { : 
  argument is of length zero

Can anyone help with this? Thanks in advance!

place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
data$time <- as.numeric(data$time)

#bad code
data$time[1] <- 1 #need to set first so that loop doesn't go to a row that doesn't exist     (i.e., row 0)
for (i in 2:NROW(data)){
    data$time[i] <- 1 #set first occurrence to 1
    if (data$uid[i] == data$uid[i-1]) {data$time[i] <- 2} #set second occurrence to 2, etc.
    #if (data$uid[i] == data$uid[i-2]) {data$time[i] <- 3}
    i <- i+1
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

云朵有点甜 2024-12-15 07:09:59

目前尚不清楚您要做什么,但我认为您是说您需要为每行每个唯一的uid创建一个时间索引。是这样吗?

如果是这样,请尝试一下,

library(plyr)
ddply(data, "uid", transform, time = seq_along(uid))

会给你类似的结果:

   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
....

It's unclear what you are trying to do, but I think you're saying that you need to create a time index for each row by every unique uid. Is that right?

If so, give this a whirl

library(plyr)
ddply(data, "uid", transform, time = seq_along(uid))

Will give you something like:

   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
....
2024-12-15 07:09:59

这是你的想法吗?

> d <- data.frame(uid = paste("ny",c(1,2,1,2,2,3,4,4,5,5),sep=""))
> out <- do.call(rbind, lapply(split(d, d$uid), function(x) {x$time <- 1:nrow(x); x}))
> rownames(out) <- NULL
> out
   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
6  ny3    1
7  ny4    1
8  ny4    2
9  ny5    1
10 ny5    2

Is this what you have in mind?

> d <- data.frame(uid = paste("ny",c(1,2,1,2,2,3,4,4,5,5),sep=""))
> out <- do.call(rbind, lapply(split(d, d$uid), function(x) {x$time <- 1:nrow(x); x}))
> rownames(out) <- NULL
> out
   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
6  ny3    1
7  ny4    1
8  ny4    2
9  ny5    1
10 ny5    2
心病无药医 2024-12-15 07:09:59

使用您的数据框设置:

place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)

您可以使用:

data$time <- sequence(table(data$uid))
data

获取:

> data
   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
6  ny3    1
7  ny4    1
8  ny4    2
9  ny5    1
10 ny5    2

注意:您的 data.frame 必须首先按 uid 排序才能正常工作。

Using your data frame setup:

place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)

You can use:

data$time <- sequence(table(data$uid))
data

To get:

> data
   uid time
1  ny1    1
2  ny1    2
3  ny2    1
4  ny2    2
5  ny2    3
6  ny3    1
7  ny4    1
8  ny4    2
9  ny5    1
10 ny5    2

NOTE: Your data.frame MUST be sorted by uid first for this to work.

绿光 2024-12-15 07:09:59

在大型数据集上尝试上述解决方案后,我决定为此编写自己的循环。这是非常耗时的,并且仍然需要将数据分解为 50k 元素向量,但它最终确实起作用了:

system.time( for(i in 2:length(data$uid)) {
if(data$uid[i]==data$uid[i-1]) data$repeats[i] <- data$repeats[i-1]+1
  if ((i %% 1000)== 0) { #helps to keep track of how far the loop has gotten
    print(i) }
    i+1
}
)

感谢大家的帮助。

After trying the above solutions on large data sets, I decided to write my own loop for this. It was very time-consuming and still required the data to be broken into 50k-element vectors, but it did work in the end:

system.time( for(i in 2:length(data$uid)) {
if(data$uid[i]==data$uid[i-1]) data$repeats[i] <- data$repeats[i-1]+1
  if ((i %% 1000)== 0) { #helps to keep track of how far the loop has gotten
    print(i) }
    i+1
}
)

Thanks to all for your help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文