将 data.frame 列从因子转换为字符
我有一个数据框。我们称他为 bob
:
> head(bob)
phenotype exclusion
GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399352 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399353 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399354 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399355 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
我想连接此数据框的行(这将是另一个问题)。但请注意:
> class(bob$phenotype)
[1] "factor"
Bob
的列是因子。因此,举例来说:
> as.character(head(bob))
[1] "c(3, 3, 3, 6, 6, 6)" "c(3, 3, 3, 3, 3, 3)"
[3] "c(29, 29, 29, 30, 30, 30)"
我不明白这一点,但我猜这些是鲍勃(卡拉克塔库斯国王的法庭)列的因子水平的索引?不是我需要的。
奇怪的是,我可以手动浏览 bob 的列,并且
bob$phenotype <- as.character(bob$phenotype)
效果很好。而且,经过一些输入后,我可以获得一个 data.frame,其列是字符而不是因子。所以我的问题是:我怎样才能自动做到这一点?如何将包含因子列的 data.frame 转换为包含字符列的 data.frame,而无需手动遍历每一列?
额外问题:为什么手动方法有效?
I have a data frame. Let's call him bob
:
> head(bob)
phenotype exclusion
GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399352 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399353 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399354 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399355 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
I'd like to concatenate the rows of this data frame (this will be another question). But look:
> class(bob$phenotype)
[1] "factor"
Bob
's columns are factors. So, for example:
> as.character(head(bob))
[1] "c(3, 3, 3, 6, 6, 6)" "c(3, 3, 3, 3, 3, 3)"
[3] "c(29, 29, 29, 30, 30, 30)"
I don't begin to understand this, but I guess these are indices into the levels of the factors of the columns (of the court of king caractacus) of bob
? Not what I need.
Strangely I can go through the columns of bob
by hand, and do
bob$phenotype <- as.character(bob$phenotype)
which works fine. And, after some typing, I can get a data.frame whose columns are characters rather than factors. So my question is: how can I do this automatically? How do I convert a data.frame with factor columns into a data.frame with character columns without having to manually go through each column?
Bonus question: why does the manual approach work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(18)
继续关注马特和德克。如果您想在不更改全局选项的情况下重新创建现有数据框,可以使用 apply 语句重新创建它:
这会将所有变量转换为“字符”类,如果您只想转换因子,请参阅 下面是 Marek 的解决方案。
正如@hadley 指出的,以下内容更加简洁。
在这两种情况下,lapply 都会输出一个列表;然而,由于 R 的神奇特性,在第二种情况下使用
[]
保留了bob
对象的 data.frame 类,从而无需转换使用as.data.frame
和参数stringsAsFactors = FALSE
返回到 data.frame。Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:
This will convert all variables to class "character", if you want to only convert factors, see Marek's solution below.
As @hadley points out, the following is more concise.
In both cases,
lapply
outputs a list; however, owing to the magical properties of R, the use of[]
in the second case keeps the data.frame class of thebob
object, thereby eliminating the need to convert back to a data.frame usingas.data.frame
with the argumentstringsAsFactors = FALSE
.仅替换因子:
在包 dplyr 版本 0.5.0 中的新函数引入了
mutate_if
:...以及 在1.0.0版本中被
across
取代:RStudio 的 Package purrr 提供了另一种选择:
To replace only factors:
In package dplyr in version 0.5.0 new function
mutate_if
was introduced:...and in version 1.0.0 was replaced by
across
:Package purrr from RStudio gives another alternative:
全局选项
可能是您想要在启动文件(例如 ~/.Rprofile)中设置为
FALSE
的内容。请参阅帮助(选项)
。The global option
may be something you want to set to
FALSE
in your startup files (e.g. ~/.Rprofile). Please seehelp(options)
.如果您了解因素的存储方式,则可以避免使用基于应用的函数来实现此目的。这并不意味着所应用的解决方案效果不佳。
因素的结构为与“级别”列表相关的数字索引。如果将因子转换为数字,就可以看到这一点。所以:
最后一行返回的数字对应于因子的水平。
请注意,levels() 返回一个字符数组。您可以利用这一事实轻松而紧凑地将因子转换为字符串或数字,如下所示:
如果您将表达式包装在
as.numeric()
中,这也适用于数值。If you understand how factors are stored, you can avoid using apply-based functions to accomplish this. Which isn't at all to imply that the apply solutions don't work well.
Factors are structured as numeric indices tied to a list of 'levels'. This can be seen if you convert a factor to numeric. So:
The numbers returned in the last line correspond to the levels of the factor.
Notice that
levels()
returns an array of characters. You can use this fact to easily and compactly convert factors to strings or numerics like this:This also works for numeric values, provided you wrap your expression in
as.numeric()
.如果您想要一个新的数据框
bobc
,其中bobf
中的每个因子向量都转换为字符向量,请尝试以下操作:如果您想将其转换回来,您可以创建一个逻辑向量,其中列是因子,并使用它来有选择地应用因子
If you want a new data frame
bobc
where every factor vector inbobf
is converted to a character vector, try this:If you then want to convert it back, you can create a logical vector of which columns are factors, and use that to selectively apply factor
我通常将此功能从我的所有项目中分离出来。快速又简单。
I typically make this function apart of all my projects. Quick and easy.
另一种方法是使用 apply 进行转换
,还有一个更好的方法(前一个是“矩阵”类)
Another way is to convert it using apply
And a better one (the previous is of class 'matrix')
更新:这是一个不起作用的示例。我认为可以,但我认为 stringsAsFactors 选项仅适用于字符串 - 它只保留因素。
试试这个:
一般来说,每当您遇到应该是字符的因素问题时,都会有一个 < code>stringsAsFactors 在某处设置可以帮助您(包括全局设置)。
Update: Here's an example of something that doesn't work. I thought it would, but I think that the stringsAsFactors option only works on character strings - it leaves the factors alone.
Try this:
Generally speaking, whenever you're having problems with factors that should be characters, there's a
stringsAsFactors
setting somewhere to help you (including a global setting).或者您可以尝试
转换
:只需确保将您想要转换为字符的每个因素都放入即可。
或者你可以做这样的事情,一击杀死所有害虫:
将数据推送到这样的代码中并不是一个好主意,我可以这样做
sapply
部分分开(实际上,这样做更容易),但你明白了......我还没有检查代码,因为我不在家,所以我希望它能起作用! =)然而,这种方法有一个缺点......之后您必须重新组织列,而使用
transform
您可以做任何您喜欢的事情,但代价是“pedestrian-style-code-”写“...所以那里... =)
Or you can try
transform
:Just be sure to put every factor you'd like to convert to character.
Or you can do something like this and kill all the pests with one blow:
It's not good idea to shove the data in code like this, I could do the
sapply
part separately (actually, it's much easier to do it like that), but you get the point... I haven't checked the code, 'cause I'm not at home, so I hope it works! =)This approach, however, has a downside... you must reorganize columns afterwards, while with
transform
you can do whatever you like, but at cost of "pedestrian-style-code-writting"...So there... =)
在数据框的开头包含
stringsAsFactors = FALSE
以忽略所有误解。At the beginning of your data frame include
stringsAsFactors = FALSE
to ignore all misunderstandings.如果您使用
data.table
包对 data.frame 进行操作,那么问题就不存在。如果数据集中已有因子列并且想要将它们转换为字符,您可以执行以下操作。
If you would use
data.table
package for the operations on data.frame then the problem is not present.If you have a factor columns in you dataset already and you want to convert them to character you can do the following.
这对我有用 - 我终于找到了一个衬垫
This works for me - I finally figured a one liner
dplyr 版本 1.0.0 中引入了新函数“across”。新函数将取代作用域变量(_if、_at、_all)。这是官方文档
New function "across" was introduced in dplyr version 1.0.0. The new function will supersede scoped variables (_if, _at, _all). Here's the official documentation
您应该在
hablar
中使用convert
,它提供与tidyverse
管道兼容的可读语法:它为您提供:
You should use
convert
inhablar
which gives readable syntax compatible withtidyverse
pipes:which gives you:
可以使用已加载的 dplyr 包。
如果您只想专门更改
phenotype
列,则With the
dplyr
-package loaded useif you only want to change the
phenotype
-column specifically.这个函数可以解决问题
This function does the trick
也许是一个更新的选择?
Maybe a newer option?
这可以将所有内容转换为字符,然后将数字转换为数字:
改编自: 获取列自动生成 Excel 工作表类型
This works transforming all to character and then the numeric to numeric:
Adapted from: Get column types of excel sheet automatically