计算行的平均值

发布于 2024-10-05 14:37:41 字数 783 浏览 8 评论 0原文

我有一个名为 ants 的数据框，详细说明了每个站点的多个条目，如下所示：

  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2

站点具有不同数量的条目，通常为 3 个，但有时更少或更多。在日期和站点编号匹配的情况下，我想编写一个新的数据框，每个站点有一个条目，详细说明每个参数的平均/平均读数。我希望从计算和后续数据框中省略空或“na”单元格。

我不确定这是一个 apply 函数还是 rowMeans 的一个版本？非常困难，非常感谢任何帮助！

原文

I have a dataframe called ants detailing multiple entries per site, looks like this:

  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2

Sites have different numbers of entries, usually 3 but sometimes less or more. Where both date and site number match I would like to write a new dataframe with one entry per site detailing the average/mean readings for each parameter. I would like empty or "na" cells to be omitted from the calculation and subsequent dataframe.

I'm not sure if this is an apply function or a version of rowMeans maybe? Very stuck, any help much appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

金橙橙 2024-10-12 14:37:41

Nico 的答案看起来就像我的答案，只是我会添加一个命名参数传递给mean()，这样NA（在聚合列中）就不会破坏结果。（我无法判断OP是否要求by变量或其他变量中的NA已知或怀疑具有NA）：

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)

您可能还需要并行运行聚合或tapply调用来计算非-的数量NA 值。

使用聚合公式方法的其他方法可能有所不同，因为 na.action=na.omit 是默认值：

aggregate( . ~Site +Date, data=df,  FUN=mean, na.rm=TRUE)

Nico's answer looked like mine would have except that I would have added a named argument to be passed to mean() so that the NA's (in the aggregated columns) would not sabotage the results. (I could not tell whether the OP was asking that NA's in the by variables or in the otehr variables were known or suspected of having NA's) :

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)

You would probably need to also run aggregate or tapply calls in parallel to count the number of non-NA values.

The other method using aggregate's formula method might be different since na.action=na.omit is the default:

aggregate( . ~Site +Date, data=df,  FUN=mean, na.rm=TRUE)

回复收藏 0 原文

箜明 2024-10-12 14:37:41

这是使用 plyr 包及其 ddply() 函数：

R> df
  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
R> library(plyr)
R> ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
  Site     Date   Temp SpCond Salinity  Depth Turbidity Chlorophyll
1   70 6/8/2010 14.300 50.035   32.770 0.9975    -0.350       1.200
2   71 6/8/2010 14.315 49.933   32.695 0.8835    -0.075       1.625
R>

我使用自定义匿名函数来跳过前三列。

Here is one way using the plyr package and its ddply() function:

R> df
  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
R> library(plyr)
R> ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
  Site     Date   Temp SpCond Salinity  Depth Turbidity Chlorophyll
1   70 6/8/2010 14.300 50.035   32.770 0.9975    -0.350       1.200
2   71 6/8/2010 14.315 49.933   32.695 0.8835    -0.075       1.625
R>

I used a custom anonymous function to skip the first three columns.

回复收藏 0 原文

_失温 2024-10-12 14:37:41

您还可以使用聚合

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)

You can also use aggregate

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)

回复收藏 0 原文

凤舞天涯 2024-10-12 14:37:41

这是一个完整的新答案，其中包含完整的日志，还涵盖您的新规范：

R> Lines <- "  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
+ 71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
+ 71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
+ 71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
+ 71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
+ 70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
+ 70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
+ "
R> con <- textConnection(Lines)
R> df <- read.table(con, sep="", header=TRUE, stringsAsFactors=FALSE)
R> close(con)
R> df$pt <- as.POSIXct(strptime(paste(df$Date, df$Time), "%m/%d/%Y %H:%M:%S"))
R> library(plyr)
R> newdf <- ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
R> newdf$pt <- as.POSIXct(newdf$pt, origin="1970-01-01")
R> newdf
  Site     Date  Temp SpCond Salinity  Depth Turbidity Chlorophyll                  pt
1   70 6/8/2010 14.30  50.03    32.77 0.9975    -0.350       1.200 2010-06-08 20:54:02
2   71 6/8/2010 14.32  49.93    32.70 0.8835    -0.075       1.625 2010-06-08 20:51:05
R>

Here is a complete new answer with a full log also covering your new specification:

R> Lines <- "  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
+ 71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
+ 71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
+ 71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
+ 71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
+ 70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
+ 70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
+ "
R> con <- textConnection(Lines)
R> df <- read.table(con, sep="", header=TRUE, stringsAsFactors=FALSE)
R> close(con)
R> df$pt <- as.POSIXct(strptime(paste(df$Date, df$Time), "%m/%d/%Y %H:%M:%S"))
R> library(plyr)
R> newdf <- ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
R> newdf$pt <- as.POSIXct(newdf$pt, origin="1970-01-01")
R> newdf
  Site     Date  Temp SpCond Salinity  Depth Turbidity Chlorophyll                  pt
1   70 6/8/2010 14.30  50.03    32.77 0.9975    -0.350       1.200 2010-06-08 20:54:02
2   71 6/8/2010 14.32  49.93    32.70 0.8835    -0.075       1.625 2010-06-08 20:51:05
R>

回复收藏 0 原文

野味少女 2024-10-12 14:37:41

您已经接近 rowMeans()，但您需要 colMeans()。其他人已经展示了如何使用内置或附加功能，我当然会建议您使用它们。但是，了解如何手动执行类似的操作可能会很有用：

## using df from Dirk's answer, we split the data in Site Date combinations
df.sp <- with(df,
              split(data.frame(Temp, SpCond, Salinity, Depth, Turbidity,
                               Chlorophyll),
                    list(Site = Site, Date = Date)))
## The above gives  a list of data frames one per date-site combo,
## to which we apply the colMeans() function
df.mean <- data.frame(t(sapply(df.sp, colMeans)))

如果您希望输出像其他人的答案一样好，那么此时我们需要进行一些额外的整理：

## Process the rownames on df.mean
name.parts <- strsplit(rownames(df.mean), "\\.")
## pull out the Site part (before the '.')
df.mean <- within(df.mean, Site <- as.numeric(sapply(name.parts, `[`, 1)))
## pull out the Date part (after the '.')
df.mean <- within(df.mean, Date <- sapply(name.parts, `[`, 2))
## rearrange the columns
df.mean <- df.mean[, c(7:8,1:6)]

再次注意，在大多数情况下，您应该使用其他答案所述的固定函数。然而，有时编写自己的解决方案可能会更快，以上内容可能可以作为实现这一目标的指南。

You were close with rowMeans(), but you need colMeans() instead. The others have shown how to use built-in or add-on functionality and I would certainly recommend you use them. However, it might be useful to see how to do something like this by hand:

## using df from Dirk's answer, we split the data in Site Date combinations
df.sp <- with(df,
              split(data.frame(Temp, SpCond, Salinity, Depth, Turbidity,
                               Chlorophyll),
                    list(Site = Site, Date = Date)))
## The above gives  a list of data frames one per date-site combo,
## to which we apply the colMeans() function
df.mean <- data.frame(t(sapply(df.sp, colMeans)))

At this point we need to do some extra tidying if you want the output to be nice like the others' answers:

## Process the rownames on df.mean
name.parts <- strsplit(rownames(df.mean), "\\.")
## pull out the Site part (before the '.')
df.mean <- within(df.mean, Site <- as.numeric(sapply(name.parts, `[`, 1)))
## pull out the Date part (after the '.')
df.mean <- within(df.mean, Date <- sapply(name.parts, `[`, 2))
## rearrange the columns
df.mean <- df.mean[, c(7:8,1:6)]

Note again, for most cases you should use the canned functions as described by the other answers. Sometimes it might be quicker to cook your own solution however, and the above might act as a guide to achieving this.

回复收藏 0 原文

~没有更多了~