获取所有分组组合的摘要，例如 SAS 中的过程摘要

发布于 2025-01-11 12:14:26 字数 3227 浏览 2 评论 0原文

（我确实知道我的问题与此相同：R 函数等效SAS 中的过程摘要但作为一个新用户，我无法评论解决方案以询问详细信息或解释，而且我无法让其中任何一个工作。）

我正在尝试将脚本从 SAS 转换为 R。目标是获得跨多个变量的数据库的广泛摘要。

起始基地是这样的：

学生 ID	Flag1	Flag2	Flag3	other flags...	Weight	Score
code1	level1	A	first	smth~~	2	12
code23	level5	C	Third	smth~else~	3	9

最后我想要这样的东西：

Flag1	Flag2	Flag3	其他标志...	nb 学生	加权平均值	std dev	min	第一个四分位数	...	最大	nb 学生在第一个十分位数	...	nb 学生在最后十分位数
level1	A	首先	smth~~	5	10.96	1.5	1	...	...	...	...	...	...
level5	。	所有第三个	smth~else~	1500	8.70	2.7	3	...	...	...	...	。 ..	...

在 SAS 中这真的很容易，因为 proc Summary 为每种可能的分组组合进行汇总，但在 R 中，您只能获得最低级别的分组。有 9 个不同级别的分组，共有 512 种组合，我认为应该有一种方法来循环某些工作。

我认为我应该这样做：

1-列出数据帧中的所有不同组合：

Flag1	Flag2	Flag3
.All	.All	.All
.All	.All	first
.All	.All	secondary
.All	A	.All
.All	B	.All
LV1	。所有	.所有
LV2	.	所有.所有
.所有	A	第一
.	所有A	第二
.所有	B	第一
.所有	B	第二
LV1	.所有	第一
LV1	.所有	第二
LV2	.	所有第一
LV2	.	所有第二
LV1	A	.
所有LV1	B	.所有
LV2	A	.所有
LV2	B	.
所有LV1	A	第一个
LV1	A	第二个
LV1	B	第
一个LV1	B	第二个
LV2	A	第
一个LV2	A	第二个
LV2	B	第一个
LV2	B	第二个

2- 创建一个 2^n 长度的循环，将调用以下函数：

3- 该函数将从最后一行开始dataframe，然后输出一个数据帧，其中包含按某些变量+列进行分组的摘要。All 表示不用于分组的变量

4-使用bind_rows 相互堆栈循环的每次迭代

原文

(I do understand that my question is equivalent to this one : R function equivalent to proc summary in SAS
But being a new user, I can't comment on the solutions to ask details or explanations and I can't get any of them to work.)

I'm trying to convert a script from SAS to R. The objective is to get a wide summary of a database across multiple variables.

The starting base is like this :

Student ID	Flag1	Flag2	Flag3	other flags...	weight	score
code1	level1	A	first	smth~~	2	12
code23	level5	C	third	smth~else~	3	9

And in the end I want something like this :

Flag1	Flag2	Flag3	other flags...	nb of students	weighted mean	std dev	min	1st quartile	...	max	nb of students in fist decile	...	nb of students in last decile
level1	A	first	smth~~	5	10.96	1.5	1	...	...	...	...	...	...
level5	.All	third	smth~else~	1500	8.70	2.7	3	...	...	...	...	...	...

In SAS it was really easy because proc summary does the summary for each combination of grouping possible, but in R, you only get the lowest level of grouping.
With 9 different levels of grouping that's 512 combinations and I think there should be a way to loop some of the work.

Here's how I think I should proceed :

1- List all the different combinations in a dataframe :

Flag1	Flag2	Flag3
.All	.All	.All
.All	.All	first
.All	.All	second
.All	A	.All
.All	B	.All
LV1	.All	.All
LV2	.All	.All
.All	A	first
.All	A	second
.All	B	first
.All	B	second
LV1	.All	first
LV1	.All	second
LV2	.All	first
LV2	.All	second
LV1	A	.All
LV1	B	.All
LV2	A	.All
LV2	B	.All
LV1	A	first
LV1	A	second
LV1	B	first
LV1	B	second
LV2	A	first
LV2	A	second
LV2	B	first
LV2	B	second

2- Make a 2^n length loop that will call the following function :

3- The function would take a line from the last dataframe and then output a dataframe that would contain the summary grouping by some variables + columns with .All for the variables not used for grouping

4- stack each iteration of the loop on each other using bind_rows

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

勿挽旧人 2025-01-18 12:14:26

我在解决这个问题时遇到了多个障碍，但最终得到了一个令人满意的解决方案：

#import the data
testbase <- read_excel("testbase.xlsx")
#list all the grouping variables
variables = c(quo(Flag1), quo(Flag2),quo(Flag3))
#create the powerset of the list of variables
listevars=powerSet(variables,length(variables),rev=FALSE)

for (i in 1:length(listevars)){
  testbase=ungroup(testbase)
  if (length(listevars[[i]])!=0){
    testbase=group_by(testbase,!!!listevars[[i]])
  }
  resumepartiel=summarize(testbase,weighted.mean(score,weight))
  varexcl=variables[!(variables %in% listevars[[i]])]
  if (length(varexcl)!=0){
    for(j in 1:length(varexcl)){
      colonne=data.frame(c(rep(".All",times = nrow(resumepartiel))))
      colonne=setNames(colonne,as_name(varexcl[[j]]))
      resumepartiel=bind_cols(colonne,resumepartiel)
    }
  }
  if(i==1){
    resume=resumepartiel
  }
  else{
    resume=bind_rows(resume,resumepartiel)
  }
}

这段代码将输出我想要的三个变量，并且仅输出加权平均值，但添加更多变量或更多汇总函数是微不足道的。

I encountered multiple hurdles solving this problem but I ended with a satisfying solution :

#import the data
testbase <- read_excel("testbase.xlsx")
#list all the grouping variables
variables = c(quo(Flag1), quo(Flag2),quo(Flag3))
#create the powerset of the list of variables
listevars=powerSet(variables,length(variables),rev=FALSE)

for (i in 1:length(listevars)){
  testbase=ungroup(testbase)
  if (length(listevars[[i]])!=0){
    testbase=group_by(testbase,!!!listevars[[i]])
  }
  resumepartiel=summarize(testbase,weighted.mean(score,weight))
  varexcl=variables[!(variables %in% listevars[[i]])]
  if (length(varexcl)!=0){
    for(j in 1:length(varexcl)){
      colonne=data.frame(c(rep(".All",times = nrow(resumepartiel))))
      colonne=setNames(colonne,as_name(varexcl[[j]]))
      resumepartiel=bind_cols(colonne,resumepartiel)
    }
  }
  if(i==1){
    resume=resumepartiel
  }
  else{
    resume=bind_rows(resume,resumepartiel)
  }
}

this code will output what I want for three variables and only the weighted mean but adding more variables or more summary functions is trivial.

回复收藏 0 原文

~没有更多了~