根据分数和几个条件将变量分组

发布于 2025-01-18 18:02:45 字数 1622 浏览 3 评论 0原文

我已经尝试了几天,我认为应该很简单,没有运气。希望有人可以帮助我!

我有一个名为“测试”的数据框架,其中具有以下变量:“公司”,“年”,“ firm_size”和“支出”。

我想在一年中将公司分配给大小组,然后在表中显示这些组的平均值,中位数,std.dev和n的支出(例如Stargazer)。因此,第一个规模组(最大的最大公司)应显示每年10%最大的公司的平均值,中值++支出。

规模群体应该是

  • 10%的公司,最大的公司
  • 最大的公司在25-50
  • %之间,最大的公司
  • 在50-75%之间,最大的
  • 公司在75-90%之间最大的
  • 10%最小的公司

这是我尝试的:

test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on


library(dplyr)
testtest = test%>%
 group_by(Variable)%>%
  dplyr::summarise(
    Mean=mean(Expenditures),
    Median=median(Expenditures),
    Std.dev=sd(Expenditures),
    N=n()
  )

stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)

如前所述,我不知道如何按百分比使用分数/组。因此,我试图根据安排firm_size下降后根据行分组分组。这样做的问题是,我不花一年的时间来考虑我需要的事情,每年要做这件事(共有20个)。
我的目的是制作一个新变量,该变量为每个尺寸组一个名称。例如,每年最大的10%最大的公司应获得一个名称为“支出,0%尺寸&lt; 10%”的变量,

此外,我在使用Stargazer展示它之前,在我计算不同的措施之前,我将计算出不同的措施。这有效。

!!编辑!! 嗨,再次,

现在在新数据集上运行代码时,我会收到错误“列表对象要键入double”(但它与以前是相同的变量)。

我指的是突变式键盘是您提供的解决方案中的“ rotate(gs = cut ++”之后

。 “ rel =“ nofollow noreferrer”>输入图像描述在这里

the_code

=“ https://i.sstatic.net/su6yn.png” rel =“ nofollow noreferrer”> the_error

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!

I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".

I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.

The size groups should be,

  • The 10% largest firms
  • The firms that are between 10-25% largest
  • The firms that are between 25-50% largest
  • The firms that are between 50-75% largest
  • The firms that are between 75-90% largest
  • The 10% smallest firms

This is what I have tried:

test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on


library(dplyr)
testtest = test%>%
 group_by(Variable)%>%
  dplyr::summarise(
    Mean=mean(Expenditures),
    Median=median(Expenditures),
    Std.dev=sd(Expenditures),
    N=n()
  )

stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)

As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"

Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.

!!EDIT!!
Hi again,

Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).

The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.

enter image description here

The_code

The_error

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

如梦亦如幻 2025-01-25 18:02:45

您可以将分位数创建为嵌套变量 (size_groups),然后使用 cut() 创建组大小 (gs)。然后按Yeargs分组,总结出你想要的指标。

test %>% 
  group_by(Year) %>% 
  mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>% 
  rowwise() %>% 
  mutate(gs = cut(
    Firm_size,c(-Inf, size_groups, Inf),
    labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>% 
  group_by(Year, gs) %>% 
  summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>% 
  rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))

输出:

# A tibble: 126 x 6
    Year Group_Size  Mean_Exp Med_Exp SD_Exp N_Firms
   <int> <fct>          <dbl>   <dbl>  <dbl>   <int>
 1  2000 Lowest 10%    20885.  21363.  3710.       3
 2  2000 10%-25%       68127.  69497. 19045.       4
 3  2000 25%-50%       42035.  35371. 30335.       6
 4  2000 50%-75%       36089.  29802. 17724.       6
 5  2000 75%-90%       53319.  54914. 19865.       4
 6  2000 Highest 10%   57756.  49941. 34162.       3
 7  2001 Lowest 10%    55945.  47359. 28283.       3
 8  2001 10%-25%       61825.  70067. 21777.       4
 9  2001 25%-50%       65088.  76340. 29960.       6
10  2001 50%-75%       57444.  53495. 32458.       6
# ... with 116 more rows

如果您想要包含年平均值的附加列,您可以从 summarize(across()) 行中删除 .groups="drop" ,然后然后将最后一行添加到管道中:

mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))

请注意,这是按每个 Group_size 中的公司数量正确加权的,因此返回与使用原始数据

test %>% group_by(Year) %>% summarize(mean(Expenditures))

输入数据执行此操作等效的结果:

set.seed(123)
test = data.frame(
  Firm = replicate(2000, sample(letters,1)),
  Year = sample(2000:2020, 2000, replace=T),
  Firm_size= ceiling(runif(2000,2000,5000)),
  Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.

test %>% 
  group_by(Year) %>% 
  mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>% 
  rowwise() %>% 
  mutate(gs = cut(
    Firm_size,c(-Inf, size_groups, Inf),
    labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>% 
  group_by(Year, gs) %>% 
  summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>% 
  rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))

Output:

# A tibble: 126 x 6
    Year Group_Size  Mean_Exp Med_Exp SD_Exp N_Firms
   <int> <fct>          <dbl>   <dbl>  <dbl>   <int>
 1  2000 Lowest 10%    20885.  21363.  3710.       3
 2  2000 10%-25%       68127.  69497. 19045.       4
 3  2000 25%-50%       42035.  35371. 30335.       6
 4  2000 50%-75%       36089.  29802. 17724.       6
 5  2000 75%-90%       53319.  54914. 19865.       4
 6  2000 Highest 10%   57756.  49941. 34162.       3
 7  2001 Lowest 10%    55945.  47359. 28283.       3
 8  2001 10%-25%       61825.  70067. 21777.       4
 9  2001 25%-50%       65088.  76340. 29960.       6
10  2001 50%-75%       57444.  53495. 32458.       6
# ... with 116 more rows

If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:

mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))

Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data

test %>% group_by(Year) %>% summarize(mean(Expenditures))

Input Data:

set.seed(123)
test = data.frame(
  Firm = replicate(2000, sample(letters,1)),
  Year = sample(2000:2020, 2000, replace=T),
  Firm_size= ceiling(runif(2000,2000,5000)),
  Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文