如何将因子转换为整数\数字而不丢失信息?

发布于 2024-09-12 21:49:36 字数 1125 浏览 7 评论 0原文

当我将因子转换为数字或整数时,我得到的是基础级别代码,而不是数字形式的值。

f <- factor(sample(runif(5), 20, replace = TRUE))
##  [1] 0.0248644019011408 0.0248644019011408 0.179684827337041 
##  [4] 0.0284090070053935 0.363644931698218  0.363644931698218 
##  [7] 0.179684827337041  0.249704354675487  0.249704354675487 
## [10] 0.0248644019011408 0.249704354675487  0.0284090070053935
## [13] 0.179684827337041  0.0248644019011408 0.179684827337041 
## [16] 0.363644931698218  0.249704354675487  0.363644931698218 
## [19] 0.179684827337041  0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218

as.numeric(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

as.integer(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

我必须求助于 paste 来获取实际值:

as.numeric(paste(f))
##  [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
##  [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901

是否有更好的方法将因子转换为数字?

When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.

f <- factor(sample(runif(5), 20, replace = TRUE))
##  [1] 0.0248644019011408 0.0248644019011408 0.179684827337041 
##  [4] 0.0284090070053935 0.363644931698218  0.363644931698218 
##  [7] 0.179684827337041  0.249704354675487  0.249704354675487 
## [10] 0.0248644019011408 0.249704354675487  0.0284090070053935
## [13] 0.179684827337041  0.0248644019011408 0.179684827337041 
## [16] 0.363644931698218  0.249704354675487  0.363644931698218 
## [19] 0.179684827337041  0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218

as.numeric(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

as.integer(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

I have to resort to paste to get the real values:

as.numeric(paste(f))
##  [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
##  [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901

Is there a better way to convert a factor to numeric?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

自由如风 2024-09-19 21:49:36

请参阅 ?factor 的警告部分:

特别是,as.numeric 应用于
一个因素是没有意义的,并且可能
通过隐式强制发生。到
将因子 f 变换为
大约其原始数字
值,as.numeric(levels(f))[f]
推荐,稍微多一点
效率比
as.numeric(as.character(f))

R 常见问题解答 有类似的建议


为什么as.numeric(levels(f))[f]as.numeric(as.character(f))更有效?

as.numeric(as.character(f)) 实际上是 as.numeric(levels(f)[f]),因此您要在 < code>length(x) 值,而不是 nlevels(x) 值。对于具有很少级别的长向量,速度差异最为明显。如果这些值大多是唯一的,则速度不会有太大差异。无论您如何进行转换,此操作都不太可能成为代码中的瓶颈,因此不必太担心。


一些时间

library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as.numeric(as.character(f)),
  paste0(x),
  paste(x),
  times = 1e5
)
## Unit: microseconds
##                         expr   min    lq      mean median     uq      max neval
##     as.numeric(levels(f))[f] 3.982 5.120  6.088624  5.405  5.974 1981.418 1e+05
##     as.numeric(levels(f)[f]) 5.973 7.111  8.352032  7.396  8.250 4256.380 1e+05
##  as.numeric(as.character(f)) 6.827 8.249  9.628264  8.534  9.671 1983.694 1e+05
##                    paste0(x) 7.964 9.387 11.026351  9.956 10.810 2911.257 1e+05
##                     paste(x) 7.965 9.387 11.127308  9.956 11.093 2419.458 1e+05

See the Warning section of ?factor:

In particular, as.numeric applied to
a factor is meaningless, and may
happen by implicit coercion. To
transform a factor f to
approximately its original numeric
values, as.numeric(levels(f))[f] is
recommended and slightly more
efficient than
as.numeric(as.character(f)).

The FAQ on R has similar advice.


Why is as.numeric(levels(f))[f] more efficent than as.numeric(as.character(f))?

as.numeric(as.character(f)) is effectively as.numeric(levels(f)[f]), so you are performing the conversion to numeric on length(x) values, rather than on nlevels(x) values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.


Some timings

library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as.numeric(as.character(f)),
  paste0(x),
  paste(x),
  times = 1e5
)
## Unit: microseconds
##                         expr   min    lq      mean median     uq      max neval
##     as.numeric(levels(f))[f] 3.982 5.120  6.088624  5.405  5.974 1981.418 1e+05
##     as.numeric(levels(f)[f]) 5.973 7.111  8.352032  7.396  8.250 4256.380 1e+05
##  as.numeric(as.character(f)) 6.827 8.249  9.628264  8.534  9.671 1983.694 1e+05
##                    paste0(x) 7.964 9.387 11.026351  9.956 10.810 2911.257 1e+05
##                     paste(x) 7.965 9.387 11.127308  9.956 11.093 2419.458 1e+05
江湖正好 2024-09-19 21:49:36

R 有许多(未记录的)转换因子的便利函数:

  • as.character.factor
  • as.data.frame.factor
  • as.Date.factor code>
  • as.list.factor
  • as.vector.factor
  • ...

但令人烦恼的是,没有任何东西可以处理 factor ->数字转换。作为 Joshua Ulrich 答案的扩展,我建议通过定义您自己的惯用函数来克服这种遗漏:

as.double.factor <- function(x) {as.numeric(levels(x))[x]}

您可以将其存储在脚本的开头,甚至更好地存储在 .Rprofile 文件。

R has a number of (undocumented) convenience functions for converting factors:

  • as.character.factor
  • as.data.frame.factor
  • as.Date.factor
  • as.list.factor
  • as.vector.factor
  • ...

But annoyingly, there is nothing to handle the factor -> numeric conversion. As an extension of Joshua Ulrich's answer, I would suggest to overcome this omission with the definition of your own idiomatic function:

as.double.factor <- function(x) {as.numeric(levels(x))[x]}

that you can store at the beginning of your script, or even better in your .Rprofile file.

亢潮 2024-09-19 21:49:36

注意:这个特定的答案不是用于将数值因子转换为数字,而是用于将分类因子转换为其相应的级别数字。


这篇文章中的每个答案都无法生成对我来说,结果是 NA 正在生成。

y2<-factor(c("A","B","C","D","A")); 
as.numeric(levels(y2))[y2] 
[1] NA NA NA NA NA Warning message: NAs introduced by coercion

对我有用的是这个 -

as.integer(y2)
# [1] 1 2 3 4 1

Note: this particular answer is not for converting numeric-valued factors to numerics, it is for converting categorical factors to their corresponding level numbers.


Every answer in this post failed to generate results for me , NAs were getting generated.

y2<-factor(c("A","B","C","D","A")); 
as.numeric(levels(y2))[y2] 
[1] NA NA NA NA NA Warning message: NAs introduced by coercion

What worked for me is this -

as.integer(y2)
# [1] 1 2 3 4 1
难以启齿的温柔 2024-09-19 21:49:36

最简单的方法是使用包 varhandle 可以接受因子向量甚至数据帧

unfactor(your_factor_variable)

此示例可以快速入门:

x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)

class(x)  # -> "character"
class(y)  # -> "numeric"

x <- factor(x)
y <- factor(y)

class(x)  # -> "factor"
class(y)  # -> "factor"

library(varhandle)
x <- unfactor(x)
y <- unfactor(y)

class(x)  # -> "character"
class(y)  # -> "numeric"

您也可以在数据帧上使用它。例如 iris 数据集:

sapply(iris, class)
萼片长度 萼片宽度 花瓣长度 花瓣宽度 物种
   “数字” “数字” “数字” “数字” “因子”
# load the package
library("varhandle")
# pass the iris to unfactor
tmp_iris <- unfactor(iris)
# check the classes of the columns
sapply(tmp_iris, class)
萼片长度 萼片宽度 花瓣长度 花瓣宽度 物种
   “数字” “数字” “数字” “数字” “字符”
# check if the last column is correctly converted
tmp_iris$Species

<前><代码> [1] “setosa” “setosa” “setosa” “setosa” “setosa”
[6] 山楂树 山楂树 山楂树 山楂树 山楂树
[11] 山楂树 山楂树 山楂树 山楂树 山楂树
[16] 山楂树 山楂树 山楂树 山楂树 山楂树
[21] 山楂树 山楂树 山楂树 山楂树 山楂树
[26] 山楂树 山楂树 山楂树 山楂树 山楂树
[31] 山楂树 山楂树 山楂树 山楂树
[36] 山楂树 山楂树 山楂树 山楂树
[41] 山楂树 山楂树 山楂树 山楂树 山楂树
[46] 山楂树 山楂树 山楂树 山楂树 山楂树
[51] “杂色” “杂色” “杂色” “杂色” “杂色”
[56] “杂色” “杂色” “杂色” “杂色” “杂色”
[61] “杂色” “杂色” “杂色” “杂色” “杂色”
[66] “杂色” “杂色” “杂色” “杂色” “杂色”
[71] “杂色” “杂色” “杂色” “杂色” “杂色”
[76] “杂色” “杂色” “杂色” “杂色” “杂色”
[81] “杂色” “杂色” “杂色” “杂色” “杂色”
[86] “杂色” “杂色” “杂色” “杂色” “杂色”
[91] “杂色” “杂色” “杂色” “杂色” “杂色”
[96] “杂色” “杂色” “杂色” “杂色” “杂色”
[101] “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡”
[106] “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡”
[111] 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州
[116] 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州
[121] “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡”
[126] “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡”
[131] “维吉尼亚” “维吉尼卡” “维吉尼卡” “维吉尼卡” “维吉尼卡”
[136] 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州
[141] 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州
[146] 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州 维吉尼亚州

The most easiest way would be to use unfactor function from package varhandle which can accept a factor vector or even a dataframe:

unfactor(your_factor_variable)

This example can be a quick start:

x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)

class(x)  # -> "character"
class(y)  # -> "numeric"

x <- factor(x)
y <- factor(y)

class(x)  # -> "factor"
class(y)  # -> "factor"

library(varhandle)
x <- unfactor(x)
y <- unfactor(y)

class(x)  # -> "character"
class(y)  # -> "numeric"

You can also use it on a dataframe. For example the iris dataset:

sapply(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"
# load the package
library("varhandle")
# pass the iris to unfactor
tmp_iris <- unfactor(iris)
# check the classes of the columns
sapply(tmp_iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
   "numeric"    "numeric"    "numeric"    "numeric"  "character"
# check if the last column is correctly converted
tmp_iris$Species
  [1] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
  [6] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [11] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [16] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [21] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [26] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [31] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [36] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [41] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [46] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [51] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [56] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [61] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [66] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [71] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [76] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [81] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [86] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [91] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [96] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
[101] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[106] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[111] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[116] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[121] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[126] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[131] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[136] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[141] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[146] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
墨离汐 2024-09-19 21:49:36

在因子标签与原始值匹配的情况下才有可能。我将用一个例子来解释它。

假设数据是向量x

x <- c(20, 10, 30, 20, 10, 40, 10, 40)

现在我将创建一个具有四个标签的因子:

f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))

1)x的类型为double,f的类型为整数。这是第一个不可避免的信息丢失。因子始终存储为整数。

> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"

2) 无法恢复到只有 f 可用的原始值 (10, 20, 30, 40)。我们可以看到 f 仅包含整数值 1、2、3、4 和两个属性 - 标签列表(“A”、“B”、“C”、“D”)和类属性“因素”。而已。

> str(f)
 Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"

$class
[1] "factor"

要恢复到原始值,我们必须知道创建因子时使用的级别值。在本例中为 c(10, 20, 30, 40)。如果我们知道原始级别(按正确的顺序),我们可以恢复到原始值。

> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE

仅当为原始数据中的所有可能值定义了标签时,这才有效。

因此,如果您需要原始值,则必须保留它们。否则,很有可能仅从某个因素就无法回复他们。

It is possible only in the case when the factor labels match the original values. I will explain it with an example.

Assume the data is vector x:

x <- c(20, 10, 30, 20, 10, 40, 10, 40)

Now I will create a factor with four labels:

f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))

1) x is with type double, f is with type integer. This is the first unavoidable loss of information. Factors are always stored as integers.

> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"

2) It is not possible to revert back to the original values (10, 20, 30, 40) having only f available. We can see that f holds only integer values 1, 2, 3, 4 and two attributes - the list of labels ("A", "B", "C", "D") and the class attribute "factor". Nothing more.

> str(f)
 Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"

$class
[1] "factor"

To revert back to the original values we have to know the values of levels used in creating the factor. In this case c(10, 20, 30, 40). If we know the original levels (in correct order), we can revert back to the original values.

> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE

And this will work only in case when labels have been defined for all possible values in the original data.

So if you will need the original values, you have to keep them. Otherwise there is a high chance it will not be possible to get back to them only from a factor.

李白 2024-09-19 21:49:36

如果您有数据框,则可以使用 hablar::convert。语法很简单:

示例 df

library(hablar)
library(dplyr)

df <- dplyr::tibble(a = as.factor(c("7", "3")),
                    b = as.factor(c("1.5", "6.3")))

解决方案

df %>% 
  convert(num(a, b))

为您提供:

# A tibble: 2 x 2
      a     b
  <dbl> <dbl>
1    7.  1.50
2    3.  6.30

或者,如果您希望一列为整数,一列为数字:

df %>% 
  convert(int(a),
          num(b))

结果为:

# A tibble: 2 x 2
      a     b
  <int> <dbl>
1     7  1.50
2     3  6.30

You can use hablar::convert if you have a data frame. The syntax is easy:

Sample df

library(hablar)
library(dplyr)

df <- dplyr::tibble(a = as.factor(c("7", "3")),
                    b = as.factor(c("1.5", "6.3")))

Solution

df %>% 
  convert(num(a, b))

gives you:

# A tibble: 2 x 2
      a     b
  <dbl> <dbl>
1    7.  1.50
2    3.  6.30

Or if you want one column to be integer and one numeric:

df %>% 
  convert(int(a),
          num(b))

results in:

# A tibble: 2 x 2
      a     b
  <int> <dbl>
1     7  1.50
2     3  6.30
月亮坠入山谷 2024-09-19 21:49:36

如果您的因子级别是整数,则 strtoi() 有效。

strtoi() works if your factor levels are integers.

§对你不离不弃 2024-09-19 21:49:36

游戏后期,无意中,我发现 trimws() 可以将 factor(3:5) 转换为 c("3","4","5 “)。然后你可以调用as.numeric()。那是:

as.numeric(trimws(x_factor_var))

late to the game, accidently, I found trimws() can convert factor(3:5) to c("3","4","5"). Then you can call as.numeric(). That is:

as.numeric(trimws(x_factor_var))
困倦 2024-09-19 21:49:36

水平完全数字化的因子上的 type.convert(f) 是另一个基本选项。

就性能而言,它大约相当于 as.numeric(as.character(f)) 但不如 as.numeric(levels(f))[f] 快。

identical(type.convert(f), as.numeric(levels(f))[f])

[1] TRUE

也就是说,如果在第一个实例中将向量创建为因子的原因尚未得到解决(即它可能包含一些无法强制为数字的字符),那么此方法将不起作用,它将返回一个因子。

levels(f)[1] <- "some character level"
identical(type.convert(f), as.numeric(levels(f))[f])

[1] FALSE

type.convert(f) on a factor whose levels are completely numeric is another base option.

Performance-wise it's about equivalent to as.numeric(as.character(f)) but not nearly as quick as as.numeric(levels(f))[f].

identical(type.convert(f), as.numeric(levels(f))[f])

[1] TRUE

That said, if the reason the vector was created as a factor in the first instance has not been addressed (i.e. it likely contained some characters that could not be coerced to numeric) then this approach won't work and it will return a factor.

levels(f)[1] <- "some character level"
identical(type.convert(f), as.numeric(levels(f))[f])

[1] FALSE
那小子欠揍 2024-09-19 21:49:36

如果您有许多 factor 列要转换为 numeric,则

df <- rapply(df, function(x) as.numeric(levels(x))[x], "factor", how =  "replace")

此解决方案对于包含混合类型的 data.frames 非常可靠,前提是所有因子级别都是数字。

If you have many factor columns to convert to numeric,

df <- rapply(df, function(x) as.numeric(levels(x))[x], "factor", how =  "replace")

This solution is robust for data.frames containing mixed types, provided all factor levels are numbers.

不疑不惑不回忆 2024-09-19 21:49:36

我发现 as.numeric(levels(f))[f] 很难使用 tidyverse 语法应用于列名列表。首先转换为字符,然后转换为整数给了我原始的数值,而无需添加额外的包。也许不是最高效/最优雅的解决方案,但使事情简单易读。

library(tidyverse)

tbl_df <- tibble(a = as.factor(c("7", "3")),
                 b = as.factor(c("1.5", "6.3")))

cols <- c("a", "b")

tbl_df %>%
  mutate(across(all_of(cols), as.character)) %>% 
  mutate(across(all_of(cols), as.numeric))

I found as.numeric(levels(f))[f] difficult to apply across a list of column names using tidyverse syntax. Converting to a character first then an integer gave me the original numeric values without having to add additional packages. Perhaps not the most performant/elegant solution but kept things simple and readable.

library(tidyverse)

tbl_df <- tibble(a = as.factor(c("7", "3")),
                 b = as.factor(c("1.5", "6.3")))

cols <- c("a", "b")

tbl_df %>%
  mutate(across(all_of(cols), as.character)) %>% 
  mutate(across(all_of(cols), as.numeric))
禾厶谷欠 2024-09-19 21:49:36

collapse 包包含一个围绕 as.numeric(levels(f))[f]as.character(levels(f))[f]< 的包装器/code> 在 as_numeric_factoras_character_factor 中。

library(collapse)
set.seed(1)
f <- factor(sample(runif(5), 5, replace = TRUE))

as_numeric_factor(f)
# [1] 0.2016819 0.5728534 0.3721239 0.5728534 0.5728534

as_character_factor(f)
# [1] "0.201681931037456" "0.572853363351896" "0.37212389963679" "0.572853363351896" "0.572853363351896"

as.numeric(levels(f))[f] 相比,它具有相似的性能。

# Unit: milliseconds
#                      expr      min        lq       mean    median        uq      max neval
#  as.numeric(levels(f))[f]   2.6026   3.01305   5.834900   3.54310   8.57450  66.3497   100
#  as.numeric(levels(f)[f]) 317.2509 336.78690 350.215388 349.85620 361.57980 401.1002   100
#      as_numeric_factor(f)   2.5793   2.92970   5.383223   3.23355   4.29355  68.4460   100

代码:

set.seed(1)
f <- factor(sample(runif(5), 1e6, replace = TRUE))
library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as_numeric_factor(f),
  times = 100
)

The collapse package includes a wrapper around as.numeric(levels(f))[f] and as.character(levels(f))[f] in as_numeric_factor and as_character_factor.

library(collapse)
set.seed(1)
f <- factor(sample(runif(5), 5, replace = TRUE))

as_numeric_factor(f)
# [1] 0.2016819 0.5728534 0.3721239 0.5728534 0.5728534

as_character_factor(f)
# [1] "0.201681931037456" "0.572853363351896" "0.37212389963679" "0.572853363351896" "0.572853363351896"

It gives similar performances compared to as.numeric(levels(f))[f].

# Unit: milliseconds
#                      expr      min        lq       mean    median        uq      max neval
#  as.numeric(levels(f))[f]   2.6026   3.01305   5.834900   3.54310   8.57450  66.3497   100
#  as.numeric(levels(f)[f]) 317.2509 336.78690 350.215388 349.85620 361.57980 401.1002   100
#      as_numeric_factor(f)   2.5793   2.92970   5.383223   3.23355   4.29355  68.4460   100

Code:

set.seed(1)
f <- factor(sample(runif(5), 1e6, replace = TRUE))
library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as_numeric_factor(f),
  times = 100
)
·深蓝 2024-09-19 21:49:36

从我能读到的许多答案中,唯一给出的方法是根据因素的数量扩大变量的数量。如果你有一个变量“pet”,级别为“dog”和“cat”,那么你最终会得到 pet_dog 和 pet_cat。

就我而言,我想通过将因子变量转换为数字变量来保持相同数量的变量,以一种可以应用于多个级别的许多变量的方式,例如 cat=1 和dog=0。

请在下面找到相应的解决方案:

crime <- data.frame(city = c("SF", "SF", "NYC"),
                    year = c(1990, 2000, 1990),
                    crime = 1:3)

indx <- sapply(crime, is.factor)

crime[indx] <- lapply(crime[indx], function(x){ 
  listOri <- unique(x)
  listMod <- seq_along(listOri)
  res <- factor(x, levels=listOri)
  res <- as.numeric(res)
  return(res)
}
)

From the many answers I could read, the only given way was to expand the number of variables according to the number of factors. If you have a variable "pet" with levels "dog" and "cat", you would end up with pet_dog and pet_cat.

In my case I wanted to stay with the same number of variables, by just translating the factor variable to a numeric one, in a way that can applied to many variables with many levels, so that cat=1 and dog=0 for instance.

Please find the corresponding solution below:

crime <- data.frame(city = c("SF", "SF", "NYC"),
                    year = c(1990, 2000, 1990),
                    crime = 1:3)

indx <- sapply(crime, is.factor)

crime[indx] <- lapply(crime[indx], function(x){ 
  listOri <- unique(x)
  listMod <- seq_along(listOri)
  res <- factor(x, levels=listOri)
  res <- as.numeric(res)
  return(res)
}
)
夜血缘 2024-09-19 21:49:36

看起来解决方案 as.numeric(levels(f))[f] 不再适用于 R 4.0。

替代解决方案:

factor2number <- function(x){
    data.frame(levels(x), 1:length(levels(x)), row.names = 1)[x, 1]
}

factor2number(yourFactor)

Looks like the solution as.numeric(levels(f))[f] no longer work with R 4.0.

Alternative solution:

factor2number <- function(x){
    data.frame(levels(x), 1:length(levels(x)), row.names = 1)[x, 1]
}

factor2number(yourFactor)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文