按列表对 data.frame 进行子集化，并按行在每个部分上应用函数

发布于 2024-08-22 21:26:47 字数 1648 浏览 18 评论 0原文

这可能看起来是一个典型的 plyr 问题，但我有不同的想法。这是我想要优化的函数（跳过 for 循环）。

# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)


dfsub <- function(dt, lst, fun) {
    # check whether dt is `data.frame`
    stopifnot (is.data.frame(dt))
    # check if vectors in lst are "whole" / integer
    # vector elements should be column indexes
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
    # fall if any non-integers in list
    idx <- rapply(lst, is.wholenumber)
    stopifnot(idx)
    # check for list length
    stopifnot(ncol(dt) == length(idx))
    # subset the data
    subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }
    # preserve names
    names(subs) <- names(lst)
    # convert to data.frame
    subs <- as.data.frame(subs)
    # guess what =)
    return(subs)
}

现在是一个简短的演示......实际上，我将解释我主要打算做什么。我想通过 list 对象中收集的向量对 data.frame 进行子集化。由于这是心理学研究中伴随数据操作的函数的代码的一部分，因此您可以将 m 视为性格调查问卷（10 个受试者，20 个变量）的结果。列表中的向量包含定义调查问卷子量表（例如人格特质）的列索引。每个子量表由多个项目（data.frame 中的列）定义。如果我们假设每个子量表的分数只不过是行值（每个主题的调查问卷该部分的结果）的 sum （或其他函数），您可以运行：

> dfsub(m, lst, sum)
    a  b  c
1  46 20 24
2  41 24 21
3  41 13 12
4  37 14 18
5  57 18 25
6  27 18 18
7  28 17 20
8  31 18 23
9  38 14 15
10 41 14 22

我看了一眼在这个函数中，我必须承认这个小循环根本不会破坏代码......但是，如果有更简单/有效的方法来做到这一点，请告诉我！

原文

This may seem as a typical plyr problem, but I have something different in mind.
Here's the function that I want to optimize (skip the for loop).

# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)


dfsub <- function(dt, lst, fun) {
    # check whether dt is `data.frame`
    stopifnot (is.data.frame(dt))
    # check if vectors in lst are "whole" / integer
    # vector elements should be column indexes
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
    # fall if any non-integers in list
    idx <- rapply(lst, is.wholenumber)
    stopifnot(idx)
    # check for list length
    stopifnot(ncol(dt) == length(idx))
    # subset the data
    subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }
    # preserve names
    names(subs) <- names(lst)
    # convert to data.frame
    subs <- as.data.frame(subs)
    # guess what =)
    return(subs)
}

And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame by vectors gathered in list object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame). If we presuppose that the score on each subscale is nothing more than sum (or some other function) of row values (results on that part of questionnaire for each subject), you could run:

> dfsub(m, lst, sum)
    a  b  c
1  46 20 24
2  41 24 21
3  41 13 12
4  37 14 18
5  57 18 25
6  27 18 18
7  28 17 20
8  31 18 23
9  38 14 15
10 41 14 22

I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

战皆罪 2024-08-29 21:26:47

我会采取不同的方法，将所有内容保留为数据帧，以便您可以使用 merge 和 ddply。我想您会发现这种方法更通用一些，并且更容易检查每个步骤是否正确执行。

# Convert everything to long data frames
m$id <- 1:nrow(m)

library(reshape)
obs <- melt(m, id = "id")
obs$variable <- as.numeric(gsub("V", "", obs$variable))

varinfo <- melt(lst)
names(varinfo) <- c("variable", "scale")

# Merge and summarise
obs <- merge(obs, varinfo, by = "variable")

ddply(obs, c("id", "scale"), summarise, 
  mean = mean(value), 
  sum = sum(value))

I'd take a different approach and keep everything as data frames so that you can use merge and ddply. I think you'll find this approach is a little more general, and it's easier to check that each step is performed correctly.

# Convert everything to long data frames
m$id <- 1:nrow(m)

library(reshape)
obs <- melt(m, id = "id")
obs$variable <- as.numeric(gsub("V", "", obs$variable))

varinfo <- melt(lst)
names(varinfo) <- c("variable", "scale")

# Merge and summarise
obs <- merge(obs, varinfo, by = "variable")

ddply(obs, c("id", "scale"), summarise, 
  mean = mean(value), 
  sum = sum(value))

回复收藏 0 原文

虫児飞 2024-08-29 21:26:47

加载plyr包后，替换

subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }

为

subs <- llply(lst,function(x) apply(dt[,x],1,fun))

after loading the plyr package, replace

subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }

with

subs <- llply(lst,function(x) apply(dt[,x],1,fun))

回复收藏 0 原文

把时间冻结 2024-08-29 21:26:47

@Hadley，我已经检查了您的回复，因为它非常简单且易于记账（除了它是更通用的解决方案之外）。然而，这是我的不太长的脚本，它只需要 base 包（这很简单，因为我安装了 plyr 和 reshape就在安装 R 之后）。现在，这是来源：

dfsub <- function(dt, lst, fun) {
        # check whether dt is `data.frame`
        stopifnot (is.data.frame(dt))
        # convert data.frame factors to numeric
        dt <- as.data.frame(lapply(dt, as.numeric))
        # check if vectors in lst are "whole" / integer
        # vector elements should be column indexes
        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
        # fall if any non-integers in list
        idx <- rapply(lst, is.wholenumber)
        stopifnot(idx)
        # check for list length
        stopifnot(ncol(dt) == length(idx))
        # subset the data
        subs <- list()
        for (i in 1:length(lst)) {
                # apply function on each part, by row
                subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
        }
        names(subs) <- names(lst)
        # convert to data.frame
        subs <- as.data.frame(subs)
        # guess what =)
        return(subs)
}

@Hadley, I've checked your response since it's quite straightforward and easy for bookkeeping (besides the fact it's more general-purpose-solution). However, here's my not-so-long script that does the thing and requires only base package (which is trivial since I install plyr and reshape just after installing R). Now, here's the source:

dfsub <- function(dt, lst, fun) {
        # check whether dt is `data.frame`
        stopifnot (is.data.frame(dt))
        # convert data.frame factors to numeric
        dt <- as.data.frame(lapply(dt, as.numeric))
        # check if vectors in lst are "whole" / integer
        # vector elements should be column indexes
        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
        # fall if any non-integers in list
        idx <- rapply(lst, is.wholenumber)
        stopifnot(idx)
        # check for list length
        stopifnot(ncol(dt) == length(idx))
        # subset the data
        subs <- list()
        for (i in 1:length(lst)) {
                # apply function on each part, by row
                subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
        }
        names(subs) <- names(lst)
        # convert to data.frame
        subs <- as.data.frame(subs)
        # guess what =)
        return(subs)
}

回复收藏 0 原文