如何折叠类别或重新分类变量?
在 R 中,我有 600,000 个分类变量,每个变量都被分类为“0”、“1”或“2”。
我想做的是折叠“1”和“2”并保留“0”本身,这样在重新分类“0”=“0”之后; “1”=“1”,“2”=“1”。最后我只想要“0”和“1”作为每个变量的类别。
另外,如果可能的话,我宁愿不创建 600,000 个新变量,如果我能用新值替换现有变量那就太好了!
最好的方法是什么?
In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".
What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.
Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!
What would be the best way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我发现使用
factor(new.levels[x])
更加通用:新级别向量的长度必须与 x 中的级别数相同,因此您也可以进行更复杂的重新编码例如使用字符串和 NA
I find this is even more generic using
factor(new.levels[x])
:The new levels vector must the same length as the number of levels in x, so you can do more complicated recodes as well using strings and NAs for example
recode() 对此有点过分了。您的情况取决于当前的编码方式。假设你的变量是 x。
如果它是数字,
如果它是字符,
如果它是级别为 0,1,2 的因子,则
这些中的任何一个都可以跨数据框 dta 应用到变量 x 。例如...
或者,一个框架的多个列
recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.
If it's numeric
if it's character
if it's factor with levels 0,1,2
Any of those can be applied across a data frame dta to the variable x in place. For example...
Or, multiple columns of a frame
car
包(应用回归的伴侣)中有一个函数recode
:或者对于您在普通 R 中的情况:
更新: 重新编码所有分类数据框
tmp
的列您可以使用以下内容There is a function
recode
in packagecar
(Companion to Applied Regression):or for your case in plain R:
Update: To recode all categorical columns of a data frame
tmp
you can use the following我喜欢 dplyr 中可以快速重新编码值的函数。
希望这有帮助:)
I liked the function in dplyr that can quickly recode values.
Hope this helps :)
请注意,如果您只想结果为 0-1 二元变量,则可以完全放弃因子:
第二行也可以写得更简洁(但可能更神秘),因为
这会将您的因子转换为一系列逻辑变量, “0”映射到
FALSE
,其他任何值映射到TRUE
。大多数代码将FALSE
和TRUE
视为 0 和 1,这反过来应该在分析中给出与使用级别为“0”和“0”的因子基本相同的结果。 “1”。事实上,如果它没有给出相同的结果,就会让人怀疑分析的正确性......Note that if you just want the results to be 0-1 binary variables, you can forego factors altogether:
The second line can also be written more succinctly (but possibly more cryptically) as
This turns your factors into a series of logical variables, with "0" mapping to
FALSE
and anything else mapping toTRUE
.FALSE
andTRUE
will be treated as 0 and 1 by most code, which in turn should give essentially the same result in an analysis as using a factor with levels "0" and "1". In fact, if it doesn't give the same result, that would cast doubt on the correctness of the analysis....您可以使用 sjmisc 的
rec
函数包,它可以一次重新编码完整的数据帧(假定所有变量至少具有相同的重新编码值)。You could use the
rec
function of the sjmisc package, which can recode a complete data frame at once (given, that all variables have at least the same recode-values).来自 tidyverse 的
forcats
包的解决方案A solution with
forcats
package from tidyverse