使用 R 中的一些 hashmap 方法有效更新数据帧列

发布于 2024-12-25 09:55:54 字数 3026 浏览 5 评论 0原文

我是 R 新手，无法弄清楚我在下面的代码中可能做错了什么以及如何加快速度。我有一个数据集，想添加一列，其中包含从两列数据计算得出的平均值。请看一下下面的代码（警告：阅读我的问题可能需要一些时间，但代码在 R 中运行良好）：

首先让我定义一个数据集 df （再次为长时间的使用表示歉意）代码描述）

> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
   prediction subject
1           0     man
2           1     dog
3           0     man
4           1    tree
5           1     car
6           1    tree
7           1     dog
8           0    tree
9           1    tree
10          1    tree

接下来，我从新表定义中向 df 添加一个名为 subjectRate 的新列

df$subjectRate <- with(df,ave(prediction,subject))
> df
       prediction subject subjectRate
    1           0     man         0.0
    2           1     dog         1.0
    3           0     man         0.0
    4           1    tree         0.8
    5           1     car         1.0
    6           1    tree         0.8
    7           1     dog         1.0
    8           0    tree         0.8
    9           1    tree         0.8
    10          1    tree         0.8

，我生成一个rateMap，以便使用用之前获得的平均值初始化的 subjectRate 列自动填充新数据。

rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
  subject subjectRate
1     man         0.0
2     dog         1.0
4    tree         0.8

现在我正在定义一个新的数据集，结合 df 中的旧主题和新主题

> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
   prediction  subject
1           1      man
2           0      cat
3           1 computer
4           0      dog
5           0     book
6           1      cat
7           1      car
8           0     book
9           0 computer
10          1      dog
11          0      cat
12          0     book
13          1      dog
14          1      man
15          1      dog

我的问题：如何有效地创建第三列？目前我正在运行下面的测试，我在地图中查找主题率，如果找到则输入值，如果没有则输入 0.5。

> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <-  sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
   prediction  subject subjectRate
1           1      man         0.0
2           0      cat         0.5
3           1 computer         0.5
4           0      dog         1.0
5           0     book         0.5
6           1      cat         0.5
7           1      car         0.5
8           0     book         0.5
9           0 computer         0.5
10          1      dog         1.0
11          0      cat         0.5
12          0     book         0.5
13          1      dog         1.0
14          1      man         0.0
15          1      dog         1.0

但是，如果使用具有类似于 subject 的多列的真实数据集（超过 200,000 行）来计算平均值，则代码需要很长时间才能运行。有人可以建议一种更好的方法来实现我想要实现的目标吗？也许进行一些合并或其他什么，但我没有想法。谢谢。

原文

I am new to R and can't figure out what I might be doing wrong in the code below and how I could speed it up.
I have a dataset and would like to add a column containing average value calculated from two column of data. Please take a look at the code below (WARNING: it could take some time to read my question but the code runs fine in R):

first let me define a dataset df (again I apologize for the long description of the code)

> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
   prediction subject
1           0     man
2           1     dog
3           0     man
4           1    tree
5           1     car
6           1    tree
7           1     dog
8           0    tree
9           1    tree
10          1    tree

Next I add a the new column called subjectRate to df

df$subjectRate <- with(df,ave(prediction,subject))
> df
       prediction subject subjectRate
    1           0     man         0.0
    2           1     dog         1.0
    3           0     man         0.0
    4           1    tree         0.8
    5           1     car         1.0
    6           1    tree         0.8
    7           1     dog         1.0
    8           0    tree         0.8
    9           1    tree         0.8
    10          1    tree         0.8

from the new table definition I generate a rateMap so as to automatically fill in new data with the subjectRate column initialized with the previously obtained average.

rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
  subject subjectRate
1     man         0.0
2     dog         1.0
4    tree         0.8

Now I am defining a new dataset with a combination of the old subject in df and new subjects

> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
   prediction  subject
1           1      man
2           0      cat
3           1 computer
4           0      dog
5           0     book
6           1      cat
7           1      car
8           0     book
9           0 computer
10          1      dog
11          0      cat
12          0     book
13          1      dog
14          1      man
15          1      dog

My question: How do I create the third column efficiently? currently I am running the test below where I look up the subject rate in the map and input the value if found, or 0.5 if not.

> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <-  sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
   prediction  subject subjectRate
1           1      man         0.0
2           0      cat         0.5
3           1 computer         0.5
4           0      dog         1.0
5           0     book         0.5
6           1      cat         0.5
7           1      car         0.5
8           0     book         0.5
9           0 computer         0.5
10          1      dog         1.0
11          0      cat         0.5
12          0     book         0.5
13          1      dog         1.0
14          1      man         0.0
15          1      dog         1.0

but with a real dataset (more than 200,000 rows) with multiple columns similar to subject to compute the average, the code takes a very long time to run. Can somebody suggest maybe a better way to do what I am trying to achieve? maybe some merge or something, but I am out of ideas.
Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

π浅易 2025-01-01 09:55:54

我怀疑（但不确定，因为我还没有测试过）这会更快：

dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]

因为它主要只使用索引和匹配。我认为这当然更简单一些。这将用 NA 填充“新”值，而不是 0.5，然后可以按照您喜欢的方式填充，

dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue

如果 ave 片段特别慢，如今的标准建议是使用 data.table 包：

require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]

这可能会吸引一些评论，建议如何在最后一行中提高数据表聚合的速度。事实上，使用纯 data.tables 进行合并或连接可能会更加流畅（并且速度更快），因此您可能也想研究该选项。（有关大量示例，请参阅 ?data.table 的最底部。）

I suspect (but am not sure, since I haven't tested it) that this will be faster:

dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]

since it mostly uses just indexing and match. It certainly a bit simpler, I think. This will fill in the "new" values with NAs, rather than 0.5, which can then be filled in however you like with,

dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue

If the ave piece is particularly slow, the standard recommendation these days is to use the data.table package:

require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]

and this will probably attract a few comments suggesting ways to eke a bit more speed out of that data table aggregation in the last line. Indeed, merging or joining using pure data.tables may be even slicker (and fast), so you might want to investigate that option as well. (See the very bottom of ?data.table for a bunch of examples.)

回复收藏 0 原文

~没有更多了~