使用 ddply 分配组 ID
R 新手提出的非常基本的性能问题。我想通过唯一的字段组合为数据框中的每一行分配一个组 ID。这是我当前的方法:
> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"),
st.num=c("101", "102", "105", "102", "150"),
st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
name st.num st.name
1 Anne 101 Main
2 Bob 102 Elm
3 Chris 105 Park
4 Dan 102 Elm
5 Erin 150 Main
>
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df,
c("st.num", "st.name"),
function(x) transform(x, household=getString()))
> df
name st.num st.name household
1 Anne 101 Main 1EZWm4BQel
2 Bob 102 Elm xNaeuo50NS
3 Dan 102 Elm xNaeuo50NS
4 Chris 105 Park Ju1NZfWlva
5 Erin 150 Main G2gKAMZ1cU
虽然这对于行数相对较少或组数较少的数据框效果很好,但对于具有许多独特组的较大数据集(> 100,000 行),我遇到了性能问题。
有什么建议可以提高这项任务的速度吗?可能与 plyr 的实验性 idata.frame() 一起使用吗?或者我对这一切都错了?
预先感谢您的帮助。
Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:
> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"),
st.num=c("101", "102", "105", "102", "150"),
st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
name st.num st.name
1 Anne 101 Main
2 Bob 102 Elm
3 Chris 105 Park
4 Dan 102 Elm
5 Erin 150 Main
>
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df,
c("st.num", "st.name"),
function(x) transform(x, household=getString()))
> df
name st.num st.name household
1 Anne 101 Main 1EZWm4BQel
2 Bob 102 Elm xNaeuo50NS
3 Dan 102 Elm xNaeuo50NS
4 Chris 105 Park Ju1NZfWlva
5 Erin 150 Main G2gKAMZ1cU
While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.
Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?
Thanks in advance for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试使用
id
函数(也在 plyr 中):更新:
自 dplyr 版本 0.5.0 以来,
id
函数已被视为已弃用。函数
group_indices
提供相同的功能。Try using the
id
function (also in plyr):Update:
The
id
function is considered deprecated since dplyr version 0.5.0.The function
group_indices
provides the same functionality.ID是否必须是随机的10个字符串?如果不是,为什么不将数据框的列粘贴在一起。如果 ID 的字符长度必须相同,请将因子转换为数字,然后将它们粘贴在一起:
然后,如果您确实需要 10 个字符 ID,我将只生成 n 个 ID,并重命名 ID 的级别 另外,顺便说一句
,您不需要将
function(x)
与 ddply 一起使用,就像使用transform()
一样。这段代码的工作原理是一样的:Is it necessary that the ID be a random 10 character string? If not, why not just paste together the columns of the data frame. If the IDs must be the same length in characters, convert factors to numeric, then paste them together:
Then, if you really need to have 10 character IDs, I'd generate just the n number of IDs, and rename the levels of ID with them
Also, as an aside, you don't need to use
function(x)
with ddply that way withtransform()
. This code would work just the same: