合并行在选定的列中共享相同的观察结果

发布于 2025-02-10 12:16:04 字数 847 浏览 2 评论 0原文

我正在清洁数据集,并且在清洁重复项后,我想合并在特定列中共享相同观察的行(例如ID列)。

我希望合并/聚合,以便每个选定的观察结果只有一排(即:每个ID 一行)。 如果可能的话,汇总行将总结所有观察值,但选择合并的观测值(ID)。

这将是假设的设置:

    set.seed(18)
    dat <- data.frame(ID=c(1,2,1,2,2,3),value=c(5,5,7,8,3,2),location=c("NY","LA","NY","LA","LA","LA"))
    dat

我想知道如何获得

    set.seed(9)
    dat1 <- data.frame(id=c(1,2,3),value=c(5+7,5+8+3,2),location=c("NY","LA","LA"))
    dat1

与ID相对于ID的汇总,将观测值“值”总结并选择相应的位置。

另外,我想知道是否可以将数据框架分组有关位置,例如获取:

    set.seed(6)
    dat2 <- data.frame(location=c("NY","LA"),value=c(5+7,5+8+3+2),meanvalue=c(mean(5+7),mean(5+8+3+2)))
    dat2

我没有将ID放入该表中,因为在这种情况下,它并不重要:可以求和或删除,它是不会考虑任何进一步的计算。 我知道我的卑鄙的输出是错误的:我希望获得所有行共享相同位置的平均值(即洛杉矶和纽约的平均值)。如果您还可以在这一方面纠正我,我将不胜感激。

感谢您的帮助!

I'm cleaning a data set and after cleaning duplicates, I would like to merge the rows that share the same observation in a specific column (e.g. ID column).

I am looking to merge/aggregate so that I only have one row per chosen observation (i.e. here: one row per ID) left.
If possible, the aggregate row would sum-up all observations but the chosen one to merge (ID).

This would be hypothetical settings:

    set.seed(18)
    dat <- data.frame(ID=c(1,2,1,2,2,3),value=c(5,5,7,8,3,2),location=c("NY","LA","NY","LA","LA","LA"))
    dat

And I would like to know how to obtain

    set.seed(9)
    dat1 <- data.frame(id=c(1,2,3),value=c(5+7,5+8+3,2),location=c("NY","LA","LA"))
    dat1

Which aggregate with respect to ID, sum the observations "value" and pick the corresponding location.

Also, I would like to know if it's possible to group the dataframe with respect to location, such as to obtain:

    set.seed(6)
    dat2 <- data.frame(location=c("NY","LA"),value=c(5+7,5+8+3+2),meanvalue=c(mean(5+7),mean(5+8+3+2)))
    dat2

I did not put ID in this table because in this case, it does not matter: it can be summed or deleted, it's not going to be taken into account for any further computation.
I know that my output for meanvalue is wrong: I am looking to get the mean of all rows sharing the same location (i.e. mean for LA and NY). I would appreciate if you also can correct me on this one.

Thank you for your help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

季末如歌 2025-02-17 12:16:04

我看到您包括set.seed,但没有看到任何采样或随机过程(除非我错过了什么)。

使用tidyverse的一种方法是以下内容。让我知道这是否是您的想法。

对于第一部分,请使用group_by基于value基于ID位置

library(tidyverse)

dat %>%
  group_by(ID, location) %>%
  summarise(sum_value = sum(value))

输出 /strong>

     ID location sum_value
  <dbl> <chr>        <dbl>
1     1 NY              12
2     2 LA              16
3     3 LA               2

在第二部分中,如果您group_by 位置,则可以使用sum and 和 mean 总结

dat %>%
  group_by(location) %>%
  summarise(sum_value = sum(value), mean_value = mean(value))

输出

  location sum_value mean_value
  <chr>        <dbl>      <dbl>
1 LA              18        4.5
2 NY              12        6  

I see that you included set.seed but did not see any sampling or randomized procedures (unless I missed something).

One approach with tidyverse is the following. Let me know if this is what you had in mind.

For the first part, use group_by to sum the value based on ID and location:

library(tidyverse)

dat %>%
  group_by(ID, location) %>%
  summarise(sum_value = sum(value))

Output

     ID location sum_value
  <dbl> <chr>        <dbl>
1     1 NY              12
2     2 LA              16
3     3 LA               2

For the second part, if you group_by the location, you can then use sum and mean with summarise:

dat %>%
  group_by(location) %>%
  summarise(sum_value = sum(value), mean_value = mean(value))

Output

  location sum_value mean_value
  <chr>        <dbl>      <dbl>
1 LA              18        4.5
2 NY              12        6  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文