处理 R 中的重复任务

发布于 2024-11-06 06:37:30 字数 891 浏览 0 评论 0原文

我经常发现自己必须在 R 中执行重复性任务。必须不断地在一个或多个数据结构上一遍又一遍地运行相同的函数,这非常令人沮丧。

例如,假设我在 R 中有三个单独的数据帧,并且我想删除每个数据帧中具有缺失值的行。对于三个数据帧,在每个 df 上运行 na.omit() 并不是那么困难,但效率可能非常低 当一个人有一百个相似的数据结构需要相同的操作时。

df1 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
             variable=c(2004,2004,2004,2004,2004,2004), value=c(35,20,20,50,30,NA))

df2 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
            variable=c(2005,2005,2005,2005,2005,2005), value=c(55,350,40,90,99,NA))

df3 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
           variable=c(2006,2006,2006,2006,2006,2006), value=c(300,200,200,500,300,NA))

tot04 <- na.omit(df1)
tot05 <- na.omit(df2)
tot06 <- na.omit(df3)

在 R 中处理重复任务的一般准则是什么?

是的,我认识到这个问题的答案是特定于人们所面临的任务的,但我只是询问用户在执行重复性任务时应该考虑的一般事项。

I often find myself having to perform repetitive tasks in R. It gets extremely frustrating having to constantly run the same function on one or more data structures over and over again.

For example, let's say I have three separate data frames in R, and I want to delete the rows in each data frame which possess a missing value. With three data frames, it's not all that difficult to run na.omit() on each of the df's, but it can get extremely inefficient
when one has one hundred similar data structures which require the same action.

df1 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
             variable=c(2004,2004,2004,2004,2004,2004), value=c(35,20,20,50,30,NA))

df2 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
            variable=c(2005,2005,2005,2005,2005,2005), value=c(55,350,40,90,99,NA))

df3 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
           variable=c(2006,2006,2006,2006,2006,2006), value=c(300,200,200,500,300,NA))

tot04 <- na.omit(df1)
tot05 <- na.omit(df2)
tot06 <- na.omit(df3)

What are some general guidelines for dealing with repetitive tasks in R?

Yes, I recognise that the answer to this question is specific to the task that one faces, but I'm just asking about general things that a user should consider when they have a repetitive task.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

左耳近心 2024-11-13 06:37:30

作为一般准则,如果您想要对多个对象应用相同的操作,则应该将它们收集到一个数据结构中。然后你可以使用循环、[sl]apply 等一次性完成操作。在这种情况下,您可以将它们放入数据帧列表中,然后运行 ​​na.omit<,而不是使用单独的数据帧 df1df2 等。 /code> 所有这些:

dflist <- list(df1, df2, <...>)
dflist <- lapply(dflist, na.omit)

As a general guideline, if you have several objects that you want to apply the same operations to, you should collect them into one data structure. Then you can use loops, [sl]apply, etc to do the operations in one go. In this case, instead of having separate data frames df1, df2, etc, you could put them into a list of data frames and then run na.omit on all of them:

dflist <- list(df1, df2, <...>)
dflist <- lapply(dflist, na.omit)
鲸落 2024-11-13 06:37:30

除了 @Hong Ooi 的回答之外,我建议查看包 plyrreshape。在您的情况下,以下内容可能有用:

df1$name <- "var1"
df2$name <- "var2" 
df3$name <- "var3"
df <- rbind(df1,df2,df3)
df <- na.omit(df)

##Get various means:
> ddply(df,~name,summarise,AvgName=mean(value))
  name AvgName
  1 var1    31.0
  2 var2   126.8
  3 var3   300.0

> ddply(df,~Region,summarise,AvgRegion=mean(value)) 
     Region AvgRegion
1    Africa 190.00000
2      Asia 130.00000
3    Europe  86.66667
4 N.America 213.33333
5 S.America 143.00000


> ddply(df,~variable,summarise,AvgVar=mean(value))
  variable AvgVar
1     2004   31.0
2     2005  126.8
3     2006  300.0

##Transform the data.frame into another format   
> cast(Region+variable~name,data=df)
      Region variable var1 var2 var3
1     Africa     2004   20   NA   NA
2     Africa     2005   NA  350   NA
3     Africa     2006   NA   NA  200
4       Asia     2004   35   NA   NA
5       Asia     2005   NA   55   NA
6       Asia     2006   NA   NA  300
7     Europe     2004   20   NA   NA
8     Europe     2005   NA   40   NA
9     Europe     2006   NA   NA  200
10 N.America     2004   50   NA   NA
11 N.America     2005   NA   90   NA
12 N.America     2006   NA   NA  500
13 S.America     2004   30   NA   NA
14 S.America     2005   NA   99   NA
15 S.America     2006   NA   NA  300

Besides @Hong Ooi answer I suggest looking into packages plyr and reshape. In your case following might be useful:

df1$name <- "var1"
df2$name <- "var2" 
df3$name <- "var3"
df <- rbind(df1,df2,df3)
df <- na.omit(df)

##Get various means:
> ddply(df,~name,summarise,AvgName=mean(value))
  name AvgName
  1 var1    31.0
  2 var2   126.8
  3 var3   300.0

> ddply(df,~Region,summarise,AvgRegion=mean(value)) 
     Region AvgRegion
1    Africa 190.00000
2      Asia 130.00000
3    Europe  86.66667
4 N.America 213.33333
5 S.America 143.00000


> ddply(df,~variable,summarise,AvgVar=mean(value))
  variable AvgVar
1     2004   31.0
2     2005  126.8
3     2006  300.0

##Transform the data.frame into another format   
> cast(Region+variable~name,data=df)
      Region variable var1 var2 var3
1     Africa     2004   20   NA   NA
2     Africa     2005   NA  350   NA
3     Africa     2006   NA   NA  200
4       Asia     2004   35   NA   NA
5       Asia     2005   NA   55   NA
6       Asia     2006   NA   NA  300
7     Europe     2004   20   NA   NA
8     Europe     2005   NA   40   NA
9     Europe     2006   NA   NA  200
10 N.America     2004   50   NA   NA
11 N.America     2005   NA   90   NA
12 N.America     2006   NA   NA  500
13 S.America     2004   30   NA   NA
14 S.America     2005   NA   99   NA
15 S.America     2006   NA   NA  300
半岛未凉 2024-11-13 06:37:30

如果名称相似,您可以使用 lspattern 参数来迭代它们:

for (i in ls(pattern="df")){
  assign(paste("t",i,sep=""),na.omit(get(i)))
}

但是,更“R”的做法似乎是使用单独的环境和eapply

# setup environment
env <- new.env()

# copy dataframes across (using common pattern)
for (i in ls(pattern="df")){
  asssign(i,get(i),envir=env)
  }

# apply function on environment
eapply(env,na.omit)

结果是:

$df3
     Region variable value
1      Asia     2006   300
2    Africa     2006   200
3    Europe     2006   200
4 N.America     2006   500
5 S.America     2006   300

$df2
     Region variable value
1      Asia     2005    55
2    Africa     2005   350
3    Europe     2005    40
4 N.America     2005    90
5 S.America     2005    99

$df1
     Region variable value
1      Asia     2004    35
2    Africa     2004    20
3    Europe     2004    20
4 N.America     2004    50
5 S.America     2004    30

不幸的是,这是一个巨大的列表,因此将其作为单独的对象取出来有点棘手。类似以下内容的内容

lapply(eapply(env,na.omit),function(x) assign(paste("t",substitute(x),sep=""),x,envir=.GlobalEnv))

应该可以工作,但是 substitute 没有正确挑选出列表元素名称。

If the names are similar you could iterate over them using the pattern argument to ls:

for (i in ls(pattern="df")){
  assign(paste("t",i,sep=""),na.omit(get(i)))
}

However, a more "R" way of doing it seems to be to use separate environment and eapply:

# setup environment
env <- new.env()

# copy dataframes across (using common pattern)
for (i in ls(pattern="df")){
  asssign(i,get(i),envir=env)
  }

# apply function on environment
eapply(env,na.omit)

Which yields:

$df3
     Region variable value
1      Asia     2006   300
2    Africa     2006   200
3    Europe     2006   200
4 N.America     2006   500
5 S.America     2006   300

$df2
     Region variable value
1      Asia     2005    55
2    Africa     2005   350
3    Europe     2005    40
4 N.America     2005    90
5 S.America     2005    99

$df1
     Region variable value
1      Asia     2004    35
2    Africa     2004    20
3    Europe     2004    20
4 N.America     2004    50
5 S.America     2004    30

Unfortunately, this is one huge list so getting this out as seperate objects is a little tricky. Something on the lines of:

lapply(eapply(env,na.omit),function(x) assign(paste("t",substitute(x),sep=""),x,envir=.GlobalEnv))

should work, but the substitute is not picking out the list element names properly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文