如何创建一个列出复制行,然后仅删除R中的一个重复项的列?

发布于 2025-02-03 02:29:07 字数 801 浏览 4 评论 0 原文

我有一个非常庞大的数据集,我正在寻找最简单(最快)的方法来创建一列,从一个特定的列总和该值,然后继续仅留下一个重复项。

我的数据集看起来像这样:

data <- data.frame(DATE_INTER = c("2015-05-29", "2013-12-13", "2009-09-08"),
                   DATE_SAIDA = c("2015-06-10", "2013-12-15", "2009-10-20"),
                   GRUPO_AIH = c("09081997", "13122006", "13122006"),
                   DIAS_PERMANENCIA = c(12, 2, 42))

我需要使用“ grupo_aih”列检查重复。我的最终输出将是这样的:

”“在此处输入映像”

我已经尝试过,但是它花费了太长时间,完成后,我什至无法用r停止工作的dplyr过滤。

data <- data %>%
  group_by(GRUPO_AIH) %>%
  mutate(DIAS_PERMANENCIA2 = sum(DIAS_PERMANENCIA))

有什么建议吗?

I have a really huge dataset and I'm looking for the simplest (and fastest) way to create a column that sums the value from one specific column and then proceed to leave only one of the duplicates.

My dataset looks like this:

data <- data.frame(DATE_INTER = c("2015-05-29", "2013-12-13", "2009-09-08"),
                   DATE_SAIDA = c("2015-06-10", "2013-12-15", "2009-10-20"),
                   GRUPO_AIH = c("09081997", "13122006", "13122006"),
                   DIAS_PERMANENCIA = c(12, 2, 42))

I need to use the column "GRUPO_AIH" to check for duplicates. My final output would be something like this:

enter image description here

I've already tried this, but it takes too long and after it's done, I can't even filter anything with dplyr that R stops working.

data <- data %>%
  group_by(GRUPO_AIH) %>%
  mutate(DIAS_PERMANENCIA2 = sum(DIAS_PERMANENCIA))

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你穿错了嫁妆 2025-02-10 02:29:07

这是答案。只有一个观察结果,在您提供的数据集示例中,您在Grupo_aih变量中实际上没有任何重复的值,因此我更改为 grupo_aih = C(“ 09081997”,“ 131222006”,“ 131222006”

data %>%
  group_by(GRUPO_AIH) %>%
  mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
  filter(duplicated(GRUPO_AIH) == FALSE)




DATE_INTER DATE_SAIDA GRUPO_AIH DIAS_PERMANENCIA
  <chr>      <chr>      <chr>                <dbl>
1 2015-05-29 2015-06-10 09081997                12
2 2013-12-13 2013-12-15 13122006                44

) sou sanitarista atuando com dados navigilânciakk

here is the answer. Just one observation, in your provided dataset example you dont really have any duplicated values in GRUPO_AIH variable so i changed to GRUPO_AIH = c("09081997", "13122006", "13122006"),

data %>%
  group_by(GRUPO_AIH) %>%
  mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
  filter(duplicated(GRUPO_AIH) == FALSE)




DATE_INTER DATE_SAIDA GRUPO_AIH DIAS_PERMANENCIA
  <chr>      <chr>      <chr>                <dbl>
1 2015-05-29 2015-06-10 09081997                12
2 2013-12-13 2013-12-15 13122006                44

OBS: também sou sanitarista atuando com dados na vigilância kk

生生漫 2025-02-10 02:29:07

如果您有一个“非常大的数据集”(数百万行),也许这将是最快的:

library(data.table)
data <- data.frame(DATE_INTER = c("2015-05-29", "2013-12-13", "2009-09-08"),
                   DATE_SAIDA = c("2015-06-10", "2013-12-15", "2009-10-20"),
                   GRUPO_AIH = c("09081997", "13122206", "13122206"),
                   DIAS_PERMANENCIA = c(12, 2, 42))

data_dt <- setDT(data)
setkey(data_dt, GRUPO_AIH)

data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
data_final
#>   DATE_INTER DATE_SAIDA GRUPO_AIH DIAS_PERMANENCIA
#> 1 2015-05-29 2015-06-10  09081997               12
#> 2 2013-12-13 2013-12-15  13122206               44

。 nofollow noreferrer“> reprex软件包(v2.0.1)


基准标准:

#install.packages("data.table")
#install.packages("dplyr")
#install.packages("microbenchmark")

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

data <- data.frame(DATE_INTER = rep(c("2015-05-29", "2013-12-13", "2009-09-08"), times = 10e5),
                   DATE_SAIDA = rep(c("2015-06-10", "2013-12-15", "2009-10-20"), times = 10e5),
                   GRUPO_AIH = rep(c("09081997", "13122206", "13122206"), times = 10e5),
                   DIAS_PERMANENCIA = rep(c(12, 2, 42), times = 10e5))

dplyr_func <- function(data){
  data_final <- data %>%
    dplyr::group_by(GRUPO_AIH) %>%
    dplyr::mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
    dplyr::filter(duplicated(GRUPO_AIH) == FALSE)
  return(data_final)
}

dt_func <- function(data){
  data_dt <- data.table::setDT(data)
  data.table::setkey(data_dt, GRUPO_AIH)
  data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
  data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
  return(data_final)
}

dplyr::all_equal(dplyr_func(data), dt_func(data))
#> [1] TRUE

res <- microbenchmark(dplyr_func(data), dt_func(data), times = 4)
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

# The difference in speed will likely become more pronounced as the size of the data increases
data <- data.frame(DATE_INTER = rep(c("2015-05-29", "2013-12-13", "2009-09-08"), times = 10e6),
                   DATE_SAIDA = rep(c("2015-06-10", "2013-12-15", "2009-10-20"), times = 10e6),
                   GRUPO_AIH = rep(c("09081997", "13122206", "13122206"), times = 10e6),
                   DIAS_PERMANENCIA = rep(c(12, 2, 42), times = 10e6))

dplyr_func <- function(data){
  data_final <- data %>%
    dplyr::group_by(GRUPO_AIH) %>%
    dplyr::mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
    dplyr::filter(duplicated(GRUPO_AIH) == FALSE)
  return(data_final)
}

dt_func <- function(data){
  data_dt <- data.table::setDT(data)
  data.table::setkey(data_dt, GRUPO_AIH)
  data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
  data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
  return(data_final)
}

res <- microbenchmark(dplyr_func(data), dt_func(data), times = 2)
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

“

由reprex软件包在2022-05-31创建(v2.0.1 )

If you have a "very large dataset" (millions of rows) perhaps this will be fastest:

library(data.table)
data <- data.frame(DATE_INTER = c("2015-05-29", "2013-12-13", "2009-09-08"),
                   DATE_SAIDA = c("2015-06-10", "2013-12-15", "2009-10-20"),
                   GRUPO_AIH = c("09081997", "13122206", "13122206"),
                   DIAS_PERMANENCIA = c(12, 2, 42))

data_dt <- setDT(data)
setkey(data_dt, GRUPO_AIH)

data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
data_final
#>   DATE_INTER DATE_SAIDA GRUPO_AIH DIAS_PERMANENCIA
#> 1 2015-05-29 2015-06-10  09081997               12
#> 2 2013-12-13 2013-12-15  13122206               44

Created on 2022-05-31 by the reprex package (v2.0.1)


Benchmarking:

#install.packages("data.table")
#install.packages("dplyr")
#install.packages("microbenchmark")

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

data <- data.frame(DATE_INTER = rep(c("2015-05-29", "2013-12-13", "2009-09-08"), times = 10e5),
                   DATE_SAIDA = rep(c("2015-06-10", "2013-12-15", "2009-10-20"), times = 10e5),
                   GRUPO_AIH = rep(c("09081997", "13122206", "13122206"), times = 10e5),
                   DIAS_PERMANENCIA = rep(c(12, 2, 42), times = 10e5))

dplyr_func <- function(data){
  data_final <- data %>%
    dplyr::group_by(GRUPO_AIH) %>%
    dplyr::mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
    dplyr::filter(duplicated(GRUPO_AIH) == FALSE)
  return(data_final)
}

dt_func <- function(data){
  data_dt <- data.table::setDT(data)
  data.table::setkey(data_dt, GRUPO_AIH)
  data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
  data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
  return(data_final)
}

dplyr::all_equal(dplyr_func(data), dt_func(data))
#> [1] TRUE

res <- microbenchmark(dplyr_func(data), dt_func(data), times = 4)
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

image_1.png

# The difference in speed will likely become more pronounced as the size of the data increases
data <- data.frame(DATE_INTER = rep(c("2015-05-29", "2013-12-13", "2009-09-08"), times = 10e6),
                   DATE_SAIDA = rep(c("2015-06-10", "2013-12-15", "2009-10-20"), times = 10e6),
                   GRUPO_AIH = rep(c("09081997", "13122206", "13122206"), times = 10e6),
                   DIAS_PERMANENCIA = rep(c(12, 2, 42), times = 10e6))

dplyr_func <- function(data){
  data_final <- data %>%
    dplyr::group_by(GRUPO_AIH) %>%
    dplyr::mutate(DIAS_PERMANENCIA = sum(DIAS_PERMANENCIA)) %>%
    dplyr::filter(duplicated(GRUPO_AIH) == FALSE)
  return(data_final)
}

dt_func <- function(data){
  data_dt <- data.table::setDT(data)
  data.table::setkey(data_dt, GRUPO_AIH)
  data_dt[, DIAS_PERMANENCIA := sum(DIAS_PERMANENCIA), by = "GRUPO_AIH"]
  data_final <- as.data.frame(data_dt[!duplicated(data_dt, by = "GRUPO_AIH")])
  return(data_final)
}

res <- microbenchmark(dplyr_func(data), dt_func(data), times = 2)
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

image_2.png

Created on 2022-05-31 by the reprex package (v2.0.1)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文