如何将因子转换为整数\数字而不丢失信息?
当我将因子转换为数字或整数时,我得到的是基础级别代码,而不是数字形式的值。
f <- factor(sample(runif(5), 20, replace = TRUE))
## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041
## [4] 0.0284090070053935 0.363644931698218 0.363644931698218
## [7] 0.179684827337041 0.249704354675487 0.249704354675487
## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935
## [13] 0.179684827337041 0.0248644019011408 0.179684827337041
## [16] 0.363644931698218 0.249704354675487 0.363644931698218
## [19] 0.179684827337041 0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218
as.numeric(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
as.integer(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
我必须求助于 paste
来获取实际值:
as.numeric(paste(f))
## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901
是否有更好的方法将因子转换为数字?
When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.
f <- factor(sample(runif(5), 20, replace = TRUE))
## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041
## [4] 0.0284090070053935 0.363644931698218 0.363644931698218
## [7] 0.179684827337041 0.249704354675487 0.249704354675487
## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935
## [13] 0.179684827337041 0.0248644019011408 0.179684827337041
## [16] 0.363644931698218 0.249704354675487 0.363644931698218
## [19] 0.179684827337041 0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218
as.numeric(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
as.integer(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
I have to resort to paste
to get the real values:
as.numeric(paste(f))
## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901
Is there a better way to convert a factor to numeric?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
请参阅
?factor
的警告部分:R 常见问题解答 有类似的建议。
为什么
as.numeric(levels(f))[f]
比as.numeric(as.character(f))
更有效?as.numeric(as.character(f))
实际上是as.numeric(levels(f)[f])
,因此您要在 < code>length(x) 值,而不是nlevels(x)
值。对于具有很少级别的长向量,速度差异最为明显。如果这些值大多是唯一的,则速度不会有太大差异。无论您如何进行转换,此操作都不太可能成为代码中的瓶颈,因此不必太担心。一些时间
See the Warning section of
?factor
:The FAQ on R has similar advice.
Why is
as.numeric(levels(f))[f]
more efficent thanas.numeric(as.character(f))
?as.numeric(as.character(f))
is effectivelyas.numeric(levels(f)[f])
, so you are performing the conversion to numeric onlength(x)
values, rather than onnlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.Some timings
R 有许多(未记录的)转换因子的便利函数:
as.character.factor
as.data.frame.factor
as.Date.factor
code>as.list.factor
as.vector.factor
但令人烦恼的是,没有任何东西可以处理 factor ->数字转换。作为 Joshua Ulrich 答案的扩展,我建议通过定义您自己的惯用函数来克服这种遗漏:
您可以将其存储在脚本的开头,甚至更好地存储在
.Rprofile
文件。R has a number of (undocumented) convenience functions for converting factors:
as.character.factor
as.data.frame.factor
as.Date.factor
as.list.factor
as.vector.factor
But annoyingly, there is nothing to handle the factor -> numeric conversion. As an extension of Joshua Ulrich's answer, I would suggest to overcome this omission with the definition of your own idiomatic function:
that you can store at the beginning of your script, or even better in your
.Rprofile
file.注意:这个特定的答案不是用于将数值因子转换为数字,而是用于将分类因子转换为其相应的级别数字。
这篇文章中的每个答案都无法生成对我来说,结果是 NA 正在生成。
对我有用的是这个 -
Note: this particular answer is not for converting numeric-valued factors to numerics, it is for converting categorical factors to their corresponding level numbers.
Every answer in this post failed to generate results for me , NAs were getting generated.
What worked for me is this -
最简单的方法是使用包 varhandle 可以接受因子向量甚至数据帧:
此示例可以快速入门:
您也可以在数据帧上使用它。例如 iris 数据集:
The most easiest way would be to use
unfactor
function from package varhandle which can accept a factor vector or even a dataframe:This example can be a quick start:
You can also use it on a dataframe. For example the
iris
dataset:仅在因子标签与原始值匹配的情况下才有可能。我将用一个例子来解释它。
假设数据是向量
x
:现在我将创建一个具有四个标签的因子:
1)
x
的类型为double,f
的类型为整数。这是第一个不可避免的信息丢失。因子始终存储为整数。2) 无法恢复到只有
f
可用的原始值 (10, 20, 30, 40)。我们可以看到f
仅包含整数值 1、2、3、4 和两个属性 - 标签列表(“A”、“B”、“C”、“D”)和类属性“因素”。而已。要恢复到原始值,我们必须知道创建因子时使用的级别值。在本例中为
c(10, 20, 30, 40)
。如果我们知道原始级别(按正确的顺序),我们可以恢复到原始值。仅当为原始数据中的所有可能值定义了标签时,这才有效。
因此,如果您需要原始值,则必须保留它们。否则,很有可能仅从某个因素就无法回复他们。
It is possible only in the case when the factor labels match the original values. I will explain it with an example.
Assume the data is vector
x
:Now I will create a factor with four labels:
1)
x
is with type double,f
is with type integer. This is the first unavoidable loss of information. Factors are always stored as integers.2) It is not possible to revert back to the original values (10, 20, 30, 40) having only
f
available. We can see thatf
holds only integer values 1, 2, 3, 4 and two attributes - the list of labels ("A", "B", "C", "D") and the class attribute "factor". Nothing more.To revert back to the original values we have to know the values of levels used in creating the factor. In this case
c(10, 20, 30, 40)
. If we know the original levels (in correct order), we can revert back to the original values.And this will work only in case when labels have been defined for all possible values in the original data.
So if you will need the original values, you have to keep them. Otherwise there is a high chance it will not be possible to get back to them only from a factor.
如果您有数据框,则可以使用
hablar::convert
。语法很简单:示例 df
解决方案
为您提供:
或者,如果您希望一列为整数,一列为数字:
结果为:
You can use
hablar::convert
if you have a data frame. The syntax is easy:Sample df
Solution
gives you:
Or if you want one column to be integer and one numeric:
results in:
如果您的因子级别是整数,则
strtoi()
有效。strtoi()
works if your factor levels are integers.游戏后期,无意中,我发现
trimws()
可以将factor(3:5)
转换为c("3","4","5 “)
。然后你可以调用as.numeric()
。那是:late to the game, accidently, I found
trimws()
can convertfactor(3:5)
toc("3","4","5")
. Then you can callas.numeric()
. That is:水平完全数字化的因子上的
type.convert(f)
是另一个基本选项。就性能而言,它大约相当于
as.numeric(as.character(f))
但不如as.numeric(levels(f))[f]
快。也就是说,如果在第一个实例中将向量创建为因子的原因尚未得到解决(即它可能包含一些无法强制为数字的字符),那么此方法将不起作用,它将返回一个因子。
type.convert(f)
on a factor whose levels are completely numeric is another base option.Performance-wise it's about equivalent to
as.numeric(as.character(f))
but not nearly as quick asas.numeric(levels(f))[f]
.That said, if the reason the vector was created as a factor in the first instance has not been addressed (i.e. it likely contained some characters that could not be coerced to numeric) then this approach won't work and it will return a factor.
如果您有许多
factor
列要转换为numeric
,则此解决方案对于包含混合类型的
data.frames
非常可靠,前提是所有因子级别都是数字。If you have many
factor
columns to convert tonumeric
,This solution is robust for
data.frames
containing mixed types, provided all factor levels are numbers.我发现
as.numeric(levels(f))[f]
很难使用 tidyverse 语法应用于列名列表。首先转换为字符,然后转换为整数给了我原始的数值,而无需添加额外的包。也许不是最高效/最优雅的解决方案,但使事情简单易读。I found
as.numeric(levels(f))[f]
difficult to apply across a list of column names using tidyverse syntax. Converting to a character first then an integer gave me the original numeric values without having to add additional packages. Perhaps not the most performant/elegant solution but kept things simple and readable.collapse
包包含一个围绕as.numeric(levels(f))[f]
和as.character(levels(f))[f]< 的包装器/code> 在
as_numeric_factor
和as_character_factor
中。与
as.numeric(levels(f))[f]
相比,它具有相似的性能。代码:
The
collapse
package includes a wrapper aroundas.numeric(levels(f))[f]
andas.character(levels(f))[f]
inas_numeric_factor
andas_character_factor
.It gives similar performances compared to
as.numeric(levels(f))[f]
.Code:
从我能读到的许多答案中,唯一给出的方法是根据因素的数量扩大变量的数量。如果你有一个变量“pet”,级别为“dog”和“cat”,那么你最终会得到 pet_dog 和 pet_cat。
就我而言,我想通过将因子变量转换为数字变量来保持相同数量的变量,以一种可以应用于多个级别的许多变量的方式,例如 cat=1 和dog=0。
请在下面找到相应的解决方案:
From the many answers I could read, the only given way was to expand the number of variables according to the number of factors. If you have a variable "pet" with levels "dog" and "cat", you would end up with pet_dog and pet_cat.
In my case I wanted to stay with the same number of variables, by just translating the factor variable to a numeric one, in a way that can applied to many variables with many levels, so that cat=1 and dog=0 for instance.
Please find the corresponding solution below:
看起来解决方案 as.numeric(levels(f))[f] 不再适用于 R 4.0。
替代解决方案:
Looks like the solution as.numeric(levels(f))[f] no longer work with R 4.0.
Alternative solution: