使用 ddply 分配组 ID

发布于 2024-09-10 04:46:55 字数 1237 浏览 3 评论 0原文

R 新手提出的非常基本的性能问题。我想通过唯一的字段组合为数据框中的每一行分配一个组 ID。这是我当前的方法:

> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), 
                   st.num=c("101", "102", "105", "102", "150"), 
                   st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
   name st.num st.name
1  Anne    101    Main
2   Bob    102     Elm
3 Chris    105    Park
4   Dan    102     Elm
5  Erin    150    Main
> 
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df, 
              c("st.num", "st.name"), 
              function(x) transform(x, household=getString()))
> df
   name st.num st.name  household
1  Anne    101    Main 1EZWm4BQel
2   Bob    102     Elm xNaeuo50NS
3   Dan    102     Elm xNaeuo50NS
4 Chris    105    Park Ju1NZfWlva
5  Erin    150    Main G2gKAMZ1cU

虽然这对于行数相对较少或组数较少的数据框效果很好,但对于具有许多独特组的较大数据集(> 100,000 行),我遇到了性能问题。

有什么建议可以提高这项任务的速度吗?可能与 plyr 的实验性 idata.frame() 一起使用吗?或者我对这一切都错了?

预先感谢您的帮助。

Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:

> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), 
                   st.num=c("101", "102", "105", "102", "150"), 
                   st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
   name st.num st.name
1  Anne    101    Main
2   Bob    102     Elm
3 Chris    105    Park
4   Dan    102     Elm
5  Erin    150    Main
> 
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df, 
              c("st.num", "st.name"), 
              function(x) transform(x, household=getString()))
> df
   name st.num st.name  household
1  Anne    101    Main 1EZWm4BQel
2   Bob    102     Elm xNaeuo50NS
3   Dan    102     Elm xNaeuo50NS
4 Chris    105    Park Ju1NZfWlva
5  Erin    150    Main G2gKAMZ1cU

While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.

Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?

Thanks in advance for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浮萍、无处依 2024-09-17 04:46:55

尝试使用 id 函数(也在 plyr 中):

df$id <- id(df[c("st.num", "st.name")], drop = TRUE)

更新:

自 dplyr 版本 0.5.0 以来,id 函数已被视为已弃用。
函数group_indices提供相同的功能。

Try using the id function (also in plyr):

df$id <- id(df[c("st.num", "st.name")], drop = TRUE)

Update:

The id function is considered deprecated since dplyr version 0.5.0.
The function group_indices provides the same functionality.

七七 2024-09-17 04:46:55

ID是否必须是随机的10个字符串?如果不是,为什么不将数据框的列粘贴在一起。如果 ID 的字符长度必须相同,请将因子转换为数字,然后将它们粘贴在一起:

df$ID <- paste(as.numeric(df$st.num), as.numeric(df$st.name), sep = "")

然后,如果您确实需要 10 个字符 ID,我将只生成 n 个 ID,并重命名 ID 的级别 另外,顺便说一句

df$ID <- as.factor(df$ID)
n <- nlevels(df$ID)

getID <- function(n, size=10){
  out <- {}
  for(i in 1:n){
    out <- c(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
  }
  return(out)
}

newLevels <- getID(n = n)

levels(df$ID) <- newLevels

,您不需要将 function(x) 与 ddply 一起使用,就像使用 transform() 一样。这段代码的工作原理是一样的:

ddply(df, c("st.num", "st.name"), transform, household=getString())

Is it necessary that the ID be a random 10 character string? If not, why not just paste together the columns of the data frame. If the IDs must be the same length in characters, convert factors to numeric, then paste them together:

df$ID <- paste(as.numeric(df$st.num), as.numeric(df$st.name), sep = "")

Then, if you really need to have 10 character IDs, I'd generate just the n number of IDs, and rename the levels of ID with them

df$ID <- as.factor(df$ID)
n <- nlevels(df$ID)

getID <- function(n, size=10){
  out <- {}
  for(i in 1:n){
    out <- c(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
  }
  return(out)
}

newLevels <- getID(n = n)

levels(df$ID) <- newLevels

Also, as an aside, you don't need to use function(x) with ddply that way with transform(). This code would work just the same:

ddply(df, c("st.num", "st.name"), transform, household=getString())
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文