如何在R上的面板数据集中添加条件和二进制变量的行

发布于 2025-01-18 00:09:54 字数 523 浏览 1 评论 0原文

我正在根据原始数据构建一个面板数据集,其中每行包含一个公司(名称)及其 10 年的销售额。

具体来说,它是这样的: original dataset

我正在构建的面板数据集必须如下所示:

在此处输入图像描述

到目前为止,我拥有所有公司的面板数据集,但仅限于他们有销售的年份。

对于每家在显示正销售额后停止销售的公司(在 x、x+1 年销售后,y 年有一个“-”),我需要添加一行复制有关该公司的信息(整行:名称、销售额、年份)并在“国家退出”列中添加 1。在上面的示例中,我必须执行第二张图片最后一行中 D 公司所做的操作。

我怎样才能避免在 R studio 上手动执行此操作,因为数据集中大约有 250 家公司存在这种情况?

谢谢,

我在 r 上尝试了一些函数,但无法以简单的方式执行它并使其易于对每个数据执行。

I am constructing a panel dataset from an original data which contains for each row, a company (name) and its sales across 10 years.

In concrete, it loos like this:
original dataset

The panel dataset I am building has to look like this:

enter image description here

So far, I have the panel dataset with all companies but only with the years when they have sales.

For each of the company that has stopped sales after showing positive sales (there is a "-" in year y after sales in years x, x+1) I need to add a row copying the info about the company (the whole row: name, sales, year) and add a 1 in the column "country exit". In the example above, I would have to do what has been done for company D in the last row in the second picture.

How can I avoid doing that manually on R studio, as there are approximately 250 companies with this case in the dataset?

Thanks

I've tried some functions on r but unable to perform it in a simple way and make it easy to do for each data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

顾北清歌寒 2025-01-25 00:09:54

这是使用tidyverse的示例。假设这是您的数据框架:

df <- structure(list(company_name = c("Company A", "Company B", "Company C", 
                                "Company D"), `2004` = c(NA, NA, NA, NA), `2005` = c(NA, NA, 
                                                                                     NA, NA), `2006` = c(NA, NA, NA, NA), `2007` = c(NA, NA, NA, NA
                                                                                     ), `2008` = c(NA, NA, NA, NA), `2009` = c(NA, NA, NA, NA), `2010` = c(NA, 
                                                                                                                                                           NA, NA, NA), `2011` = c(NA, NA, NA, NA), `2012` = c(0, NA, 0.2, 
                                                                                                                                                                                                               0.1), `2013` = c(0, 0.1, 0.3, NA), `2014` = c(0, 0.1, 0.5, NA
                                                                                                                                                                                                               )), class = "data.frame", row.names = c(NA, -4L))

> df
  company_name 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1    Company A   NA   NA   NA   NA   NA   NA   NA   NA  0.0  0.0  0.0
2    Company B   NA   NA   NA   NA   NA   NA   NA   NA   NA  0.1  0.1
3    Company C   NA   NA   NA   NA   NA   NA   NA   NA  0.2  0.3  0.5
4    Company D   NA   NA   NA   NA   NA   NA   NA   NA  0.1   NA   NA

首先要做的是在所有列上应用pivot_longer,但公司名称以创建Year Year columne:

library(tidyverse)

df2 <- df %>% pivot_longer(-company_name, names_to = 'year', values_to = 'sales')
> df2
# A tibble: 44 x 3
   company_name year  sales
   <chr>        <chr> <dbl>
 1 Company A    2004     NA
 2 Company A    2005     NA
 3 Company A    2006     NA
 4 Company A    2007     NA
 5 Company A    2008     NA
 6 Company A    2009     NA
 7 Company A    2010     NA
 8 Company A    2011     NA
 9 Company A    2012      0
10 Company A    2013      0
# ... with 34 more rows

确保数据帧正确安排(Company_nameYear),group_by公司名称,对于每个公司,请检查:如果sales sales在一行中等于na,但是在上一行中,它大于0,在新列中put 1 country exit exit

df2 <- df2 %>%
  arrange(company_name, year) %>%
  group_by(company_name) %>%
  mutate(`Country Exit` = ifelse(is.na(sales)&lag(sales) != 0, 1, 0))

> df2
# A tibble: 44 x 4
# Groups:   company_name [4]
   company_name year  sales `Country Exit`
   <chr>        <chr> <dbl>          <dbl>
 1 Company A    2004     NA             NA
 2 Company A    2005     NA             NA
 3 Company A    2006     NA             NA
 4 Company A    2007     NA             NA
 5 Company A    2008     NA             NA
 6 Company A    2009     NA             NA
 7 Company A    2010     NA             NA
 8 Company A    2011     NA             NA
 9 Company A    2012      0              0
10 Company A    2013      0              0
# ... with 34 more rows

和获取清洁器输出,您正在提到,只需从country exit中删除na s:

df2 %>% filter(!is.na(`Country Exit`))
# A tibble: 10 x 4
# Groups:   company_name [4]
   company_name year  sales `Country Exit`
   <chr>        <chr> <dbl>          <dbl>
 1 Company A    2012    0                0
 2 Company A    2013    0                0
 3 Company A    2014    0                0
 4 Company B    2013    0.1              0
 5 Company B    2014    0.1              0
 6 Company C    2012    0.2              0
 7 Company C    2013    0.3              0
 8 Company C    2014    0.5              0
 9 Company D    2012    0.1              0
10 Company D    2013   NA                1

This is an example using tidyverse. Let's say this is your dataframe:

df <- structure(list(company_name = c("Company A", "Company B", "Company C", 
                                "Company D"), `2004` = c(NA, NA, NA, NA), `2005` = c(NA, NA, 
                                                                                     NA, NA), `2006` = c(NA, NA, NA, NA), `2007` = c(NA, NA, NA, NA
                                                                                     ), `2008` = c(NA, NA, NA, NA), `2009` = c(NA, NA, NA, NA), `2010` = c(NA, 
                                                                                                                                                           NA, NA, NA), `2011` = c(NA, NA, NA, NA), `2012` = c(0, NA, 0.2, 
                                                                                                                                                                                                               0.1), `2013` = c(0, 0.1, 0.3, NA), `2014` = c(0, 0.1, 0.5, NA
                                                                                                                                                                                                               )), class = "data.frame", row.names = c(NA, -4L))

> df
  company_name 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1    Company A   NA   NA   NA   NA   NA   NA   NA   NA  0.0  0.0  0.0
2    Company B   NA   NA   NA   NA   NA   NA   NA   NA   NA  0.1  0.1
3    Company C   NA   NA   NA   NA   NA   NA   NA   NA  0.2  0.3  0.5
4    Company D   NA   NA   NA   NA   NA   NA   NA   NA  0.1   NA   NA

First thing to do is applying pivot_longer on all columns but company name to create the year column:

library(tidyverse)

df2 <- df %>% pivot_longer(-company_name, names_to = 'year', values_to = 'sales')
> df2
# A tibble: 44 x 3
   company_name year  sales
   <chr>        <chr> <dbl>
 1 Company A    2004     NA
 2 Company A    2005     NA
 3 Company A    2006     NA
 4 Company A    2007     NA
 5 Company A    2008     NA
 6 Company A    2009     NA
 7 Company A    2010     NA
 8 Company A    2011     NA
 9 Company A    2012      0
10 Company A    2013      0
# ... with 34 more rows

Make sure the dataframe is arranged correctly (by company_name and year), group_by company name, and for each company check: if sales in a row equals NA, but in the previous row it's larger than 0, put 1 in the new column Country Exit:

df2 <- df2 %>%
  arrange(company_name, year) %>%
  group_by(company_name) %>%
  mutate(`Country Exit` = ifelse(is.na(sales)&lag(sales) != 0, 1, 0))

> df2
# A tibble: 44 x 4
# Groups:   company_name [4]
   company_name year  sales `Country Exit`
   <chr>        <chr> <dbl>          <dbl>
 1 Company A    2004     NA             NA
 2 Company A    2005     NA             NA
 3 Company A    2006     NA             NA
 4 Company A    2007     NA             NA
 5 Company A    2008     NA             NA
 6 Company A    2009     NA             NA
 7 Company A    2010     NA             NA
 8 Company A    2011     NA             NA
 9 Company A    2012      0              0
10 Company A    2013      0              0
# ... with 34 more rows

And to get the cleaner output, like the one you are mentioning, just remove NAs from Country Exit:

df2 %>% filter(!is.na(`Country Exit`))
# A tibble: 10 x 4
# Groups:   company_name [4]
   company_name year  sales `Country Exit`
   <chr>        <chr> <dbl>          <dbl>
 1 Company A    2012    0                0
 2 Company A    2013    0                0
 3 Company A    2014    0                0
 4 Company B    2013    0.1              0
 5 Company B    2014    0.1              0
 6 Company C    2012    0.2              0
 7 Company C    2013    0.3              0
 8 Company C    2014    0.5              0
 9 Company D    2012    0.1              0
10 Company D    2013   NA                1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文