如何折叠类别或重新分类变量?

发布于 2024-09-10 09:02:13 字数 213 浏览 0 评论 0原文

在 R 中,我有 600,000 个分类变量,每个变量都被分类为“0”、“1”或“2”。

我想做的是折叠“1”和“2”并保留“0”本身,这样在重新分类“0”=“0”之后; “1”=“1”,“2”=“1”。最后我只想要“0”和“1”作为每个变量的类别。

另外,如果可能的话,我宁愿不创建 600,000 个新变量,如果我能用新值替换现有变量那就太好了!

最好的方法是什么?

In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".

What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.

Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!

What would be the best way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

赠我空喜 2024-09-17 09:02:13

我发现使用 factor(new.levels[x]) 更加通用:

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE)) 
> x
 [1] 0 2 2 2 1 2 2 0 2 1
Levels: 0 1 2
> new.levels<-c(0,1,1)
> x <- factor(new.levels[x])
> x
 [1] 0 1 1 1 1 1 1 0 1 1
Levels: 0 1

新级别向量的长度必须与 x 中的级别数相同,因此您也可以进行更复杂的重新编码例如使用字符串和 NA

x <- factor(c("old", "new", NA)[x])
> x
 [1] old    <NA>   <NA>   <NA>   new <NA>   <NA>   old   
 [9] <NA>   new    
Levels: new old

I find this is even more generic using factor(new.levels[x]):

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE)) 
> x
 [1] 0 2 2 2 1 2 2 0 2 1
Levels: 0 1 2
> new.levels<-c(0,1,1)
> x <- factor(new.levels[x])
> x
 [1] 0 1 1 1 1 1 1 0 1 1
Levels: 0 1

The new levels vector must the same length as the number of levels in x, so you can do more complicated recodes as well using strings and NAs for example

x <- factor(c("old", "new", NA)[x])
> x
 [1] old    <NA>   <NA>   <NA>   new <NA>   <NA>   old   
 [9] <NA>   new    
Levels: new old
风渺 2024-09-17 09:02:13

recode() 对此有点过分了。您的情况取决于当前的编码方式。假设你的变量是 x。

如果它是数字,

x <- ifelse(x>1, 1, x)

如果它是字符,

x <- ifelse(x=='2', '1', x)

如果它是级别为 0,1,2 的因子,则

levels(x) <- c(0,1,1)

这些中的任何一个都可以跨数据框 dta 应用到变量 x 。例如...

 dta$x <- ifelse(dta$x > 1, 1, dta$x)

或者,一个框架的多个列

 df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))

recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.

If it's numeric

x <- ifelse(x>1, 1, x)

if it's character

x <- ifelse(x=='2', '1', x)

if it's factor with levels 0,1,2

levels(x) <- c(0,1,1)

Any of those can be applied across a data frame dta to the variable x in place. For example...

 dta$x <- ifelse(dta$x > 1, 1, dta$x)

Or, multiple columns of a frame

 df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))
榆西 2024-09-17 09:02:13

car 包(应用回归的伴侣)中有一个函数 recode

require("car")    
recode(x, "c('1','2')='1'; else='0'")

或者对于您在普通 R 中的情况:

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
 [1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
> factor(pmin(as.numeric(x), 2), labels=c("0","1"))
 [1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

更新: 重新编码所有分类数据框 tmp 的列您可以使用以下内容

recode_fun <- function(x) factor(pmin(as.numeric(x), 2), labels=c("0","1"))
require("plyr")
catcolwise(recode_fun)(tmp)

There is a function recode in package car (Companion to Applied Regression):

require("car")    
recode(x, "c('1','2')='1'; else='0'")

or for your case in plain R:

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
 [1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
> factor(pmin(as.numeric(x), 2), labels=c("0","1"))
 [1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

Update: To recode all categorical columns of a data frame tmp you can use the following

recode_fun <- function(x) factor(pmin(as.numeric(x), 2), labels=c("0","1"))
require("plyr")
catcolwise(recode_fun)(tmp)
好菇凉咱不稀罕他 2024-09-17 09:02:13

我喜欢 dplyr 中可以快速重新编码值的函数。

 library(dplyr)
 df$x <- recode(df$x, old = "new")

希望这有帮助:)

I liked the function in dplyr that can quickly recode values.

 library(dplyr)
 df$x <- recode(df$x, old = "new")

Hope this helps :)

千紇 2024-09-17 09:02:13

请注意,如果您只想结果为 0-1 二元变量,则可以完全放弃因子:

f <- sapply(your.data.frame, is.factor)
your.data.frame[f] <- lapply(your.data.frame[f], function(x) x != "0")

第二行也可以写得更简洁(但可能更神秘),因为

your.data.frame[f] <- lapply(your.data.frame[f], `!=`, "0")

这会将您的因子转换为一系列逻辑变量, “0”映射到FALSE,其他任何值映射到TRUE。大多数代码将 FALSETRUE 视为 0 和 1,这反过来应该在分析中给出与使用级别为“0”和“0”的因子基本相同的结果。 “1”。事实上,如果它没有给出相同的结果,就会让人怀疑分析的正确性......

Note that if you just want the results to be 0-1 binary variables, you can forego factors altogether:

f <- sapply(your.data.frame, is.factor)
your.data.frame[f] <- lapply(your.data.frame[f], function(x) x != "0")

The second line can also be written more succinctly (but possibly more cryptically) as

your.data.frame[f] <- lapply(your.data.frame[f], `!=`, "0")

This turns your factors into a series of logical variables, with "0" mapping to FALSE and anything else mapping to TRUE. FALSE and TRUE will be treated as 0 and 1 by most code, which in turn should give essentially the same result in an analysis as using a factor with levels "0" and "1". In fact, if it doesn't give the same result, that would cast doubt on the correctness of the analysis....

聽兲甴掵 2024-09-17 09:02:13

您可以使用 sjmiscrec 函数包,它可以一次重新编码完整的数据帧(假定所有变量至少具有相同的重新编码值)。

library(sjmisc)
mydf <- data.frame(a = sample(0:2, 10, T),
                   b = sample(0:2, 10, T),
                   c = sample(0:2, 10, T))

> mydf
   a b c
1  1 1 0
2  1 0 1
3  0 2 0
4  0 1 0
5  1 0 0
6  2 1 1
7  0 1 1
8  2 1 2
9  1 1 2
10 2 0 1

mydf <- rec(mydf, "0=0; 1,2=1")

   a b c
1  1 1 0
2  1 0 1
3  0 1 0
4  0 1 0
5  1 0 0
6  1 1 1
7  0 1 1
8  1 1 1
9  1 1 1
10 1 0 1

You could use the rec function of the sjmisc package, which can recode a complete data frame at once (given, that all variables have at least the same recode-values).

library(sjmisc)
mydf <- data.frame(a = sample(0:2, 10, T),
                   b = sample(0:2, 10, T),
                   c = sample(0:2, 10, T))

> mydf
   a b c
1  1 1 0
2  1 0 1
3  0 2 0
4  0 1 0
5  1 0 0
6  2 1 1
7  0 1 1
8  2 1 2
9  1 1 2
10 2 0 1

mydf <- rec(mydf, "0=0; 1,2=1")

   a b c
1  1 1 0
2  1 0 1
3  0 1 0
4  0 1 0
5  1 0 0
6  1 1 1
7  0 1 1
8  1 1 1
9  1 1 1
10 1 0 1
影子的影子 2024-09-17 09:02:13

来自 tidyverse 的 forcats 包的解决方案

library(forcats)

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
[1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
    
> fct_collapse(x, "1" = c("1", "2"))
[1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

A solution with forcats package from tidyverse

library(forcats)

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
[1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
    
> fct_collapse(x, "1" = c("1", "2"))
[1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文