R 中的因素:不仅仅是烦恼?
R 中的基本数据类型之一是因子。根据我的经验,因素基本上是一种痛苦,我从不使用它们。我总是转换为字符。我感觉很奇怪,好像我错过了什么。
是否存在一些使用因子作为需要因子数据类型的分组变量的函数的重要示例?是否存在我应该使用因子的特定情况?
One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.
Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
你应该使用因素。是的,它们可能很痛苦,但我的理论是,它们之所以令人痛苦,90% 是因为在
read.table
和read.csv
中,参数stringsAsFactors = TRUE(大多数用户都忽略了这个微妙之处)。我说它们很有用,因为像 lme4 这样的模型拟合包使用因子和有序因子来差异化拟合模型并确定要使用的对比类型。绘图包也使用它们进行分组。 ggplot 和大多数模型拟合函数将字符向量强制为因子,因此结果是相同的。但是,您最终会在代码中收到警告:
一个棘手的事情是整个
drop=TRUE
位。在向量中,这可以很好地消除数据中不存在的因素水平。例如:但是,使用
data.frame
时,[.data.frame()
的行为是不同的:请参阅此电子邮件 或?"[.data.frame"< /代码>。在
data.frame
上使用drop=TRUE
并不像您想象的那样工作:幸运的是,您可以使用
droplevels()
轻松删除因子删除单个因子或data.frame
中每个因子的未使用因子级别(自 R 2.12 起):这是如何防止您选择的级别进入
ggplot
代码>传说。在内部,
factor
是具有属性级别字符向量的整数(请参阅attributes(iris$Species)
和class(attributes(iris$Species)$levels)< /code>),这是干净的。如果您必须更改关卡名称(并且您使用的是字符串),那么这将是一个效率低得多的操作。我经常更改关卡名称,尤其是 ggplot 图例。如果您使用字符向量伪造因子,则存在仅更改一个元素并意外创建单独的新关卡的风险。
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in
read.table
andread.csv
, the argumentstringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by.ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:One tricky thing is the whole
drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:However, with
data.frame
s, the behavior of[.data.frame()
is different: see this email or?"[.data.frame"
. Usingdrop=TRUE
ondata.frame
s does not work as you'd imagine:Luckily you can drop factors easily with
droplevels()
to drop unused factor levels for an individual factor or for every factor in adata.frame
(since R 2.12):This is how to keep levels you've selected out from getting in
ggplot
legends.Internally,
factor
s are integers with an attribute level character vector (seeattributes(iris$Species)
andclass(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially forggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.有序因素很棒,如果我碰巧喜欢橙子,讨厌苹果,但不介意葡萄,我不需要管理一些奇怪的索引来这么说:
ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:
因子
与其他语言中的枚举类型最相似。它的适当用途是对于只能采用规定的一组值之一的变量。在这些情况下,并非每个可能的允许值都可能出现在任何特定的数据集中,并且“空”级别准确地反映了这一点。考虑一些例子。对于在美国各地收集的某些数据,应将州记录为一个因素。在这种情况下,没有从特定州收集任何病例这一事实是相关的。可能有来自该状态的数据,但碰巧(无论出于何种原因,这可能是一个令人感兴趣的原因)没有。如果收集了家乡,那就不是一个因素了。没有预先规定的一组可能的家乡。如果数据是从三个城镇而不是全国收集的,则该城镇将是一个因素:一开始就给出了三个选择,如果在这三个城镇之一没有发现相关病例/数据,则该城镇是相关的。
因子的其他方面,例如提供一种为一组字符串提供任意排序顺序的方法,是因子的有用的次要特征,但不是原因为了他们的存在。
A
factor
is most analogous to an enumerated type in other languages. Its appropriate use is for a variable which can only take on one of prescribed set of values. In these cases, not every possible allowed value may be present in any particular set of data and the "empty" levels accurately reflect that.Consider some examples. For some data which was collected all across the United States, the state should be recorded as a factor. In this case, the fact that no cases were collected from a particular state is relevant. There could have been data from that state, but there happened (for whatever reason, which may be a reason of interest) to not be. If hometown was collected, it would not be a factor. There is not a pre-stated set of possible hometowns. If data were collected from three towns rather than nationally, the town would be a factor: there are three choices that were given at the outset and if no relevant cases/data were found in one of those three towns, that is relevant.
Other aspects of
factor
s, such as providing a way to give an arbitrary sort order to a set of strings, are useful secondary characteristics offactor
s, but are not the reason for their existence.当人们进行统计分析和实际探索数据时,因素是非常棒的。然而,在此之前,当人们读取、清理、排除故障、合并和一般操作数据时,因素是一种完全痛苦的事情。最近,与过去几年一样,许多功能都得到了改进,可以更好地处理这些因素。例如,rbind 可以很好地配合它们。我仍然发现在子集函数之后留下空的级别是非常麻烦的。
我知道重新编码因子的水平并重新调整标签很简单,而且还有一些很好的方法可以对水平进行重新排序。我的大脑就是记不住它们,每次使用时我都必须重新学习。重新编码应该比现在容易得多。
R 的字符串函数使用起来非常简单且符合逻辑。因此,在操纵时,我通常更喜欢角色而不是因素。
Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain. More recently, as in the past few years a lot of the functions have improved to handle the factors better. For instance, rbind plays nicely with them. I still find it a total nuisance to have left over empty levels after a subset function.
I know that it is straightforward to recode levels of a factor and to rejig the labels and there are also wonderful ways to reorder the levels. My brain just cannot remember them and I have to relearn it every time I use it. Recoding should just be a lot easier than it is.
R's string functions are quite easy and logical to use. So when manipulating I generally prefer characters over factors.
多么讽刺的标题啊!
我相信许多估计函数允许您使用因子来轻松定义虚拟变量......但我不使用它们。
当我有非常大的字符向量且几乎没有独特的观察结果时,我会使用它们。这可以减少内存消耗,特别是当字符向量中的字符串较长时。
PS-我是在开玩笑这个标题。我看到了你的推文。 ;-)
What a snarky title!
I believe many estimation functions allow you to use factors to easily define dummy variables... but I don't use them for that.
I use them when I have very large character vectors with few unique observations. This can cut down on memory consumption, especially if the strings in the character vector are longer-ish.
PS - I'm joking about the title. I saw your tweet. ;-)
Factors 是一个出色的“独特案例”徽章引擎。我已经多次糟糕地重现了这一点,尽管偶尔会出现一些皱纹,但它们非常强大。
如果有更好的方法来完成这项任务,我很乐意看到它,但我没有看到讨论
factor
的这种功能。Factors are an excellent "unique-cases" badging engine. I've recreated this badly many times, and despite a couple of wrinkles occasionally, they are extremely powerful.
If there's a better way to do this task I'd love to see it, I don't see this capability of
factor
discussed.只有使用因素,我们才能通过将
NA
设置为因素级别来处理NA
。这很方便,因为许多函数都省略了NA
值。让我们生成一些玩具数据:如果我们想要按
g
分组x
的平均值,我们可以使用如您所见,我们没有得到
x
的平均值> 对于g
是NA
的情况。如果我们使用by
代替(请参阅by(df$x, list(df$g),mean)
),也会出现同样的问题。还有许多其他类似的示例,其中函数(默认情况下或一般情况下)不考虑NA
。但我们可以添加
NA
作为因子水平。请参阅此处:是的,我们看到了
x
的平均值,其中g
具有NA
。有人可能会说,使用paste0
可以得到相同的输出,这是正确的(尝试aggregate(x ~ Paste0(g), df,mean)
)。但只有通过addNA
,我们才能将NA
反向转换为实际的缺失值。因此,我们首先用addNA
转换g
,然后对其进行反向转换:现在
g_back
中的NA
是实际缺失的。请参阅返回TRUE
的any(is.na(df$g_back))
。这甚至在奇怪的情况下也有效,其中
"NA"
是原始向量中的值!例如,向量vec <- c("a", "NA", NA)
可以使用vec_addNA <- addNA(vec)
进行变换,我们可以 转换另一方面,据我所知,我们无法对
vec_paste0 <- Paste0(vec)
进行反向 ,因为在vec_paste0
中,"NA" 和
NA
是相同的!请参阅我以“只有使用因子,我们才能通过将它们设置为因子级别来处理 NA”来开始回答。事实上,我会小心使用
addNA
,但不管与addNA
相关的风险如何,事实是字符没有类似的选项。Only with factors we can handle
NA
s by setting them as factor level. This is handy because many functions leave outNA
values. Let's generate some toy data:If we want means of
x
grouped byg
we can useAs you can see we do not get the mean of
x
for the case whereg
is anNA
. Same problem is true if we useby
instead (seeby(df$x, list(df$g), mean)
). There are many other similiar examples where functions (by default or in general) do not considerNA
s.But we can add
NA
as a factor level. See here:Yeah, we see the mean of
x
whereg
hasNA
s. One could argue that same output is possible withpaste0
which is true (tryaggregate(x ~ paste0(g), df, mean)
). But only withaddNA
we can backtransform theNA
s to actual missings. So let's firstly transformg
withaddNA
and then backtransform it:Now the
NA
s ing_back
are actual missings. Seeany(is.na(df$g_back))
which returns aTRUE
.This even works in strange situations where
"NA"
was a value in the original vector! For example, the vectorvec <- c("a", "NA", NA)
can be transformed usingvec_addNA <- addNA(vec)
and we can actually backtransform this withOn the other hand, to my knowledge we can not backtransform
vec_paste0 <- paste0(vec)
because invec_paste0
the"NA"
and theNA
are the same! SeeI started the answer with "Only with factors we can handle NAs by setting them as factor level.". In fact I would be careful using
addNA
but regardless of the risk associated withaddNA
the fact stands that there is no similiar option for characters.点击(和聚合)取决于因素。这些功能的信息与工作量之比非常高。
例如,在一行代码中(下面调用tapply),您可以按切工和颜色获取钻石的平均价格:
tapply (and aggregate) rely on factors. The information-to-effort ratio of these functions is very high.
For instance, in a single line of code (the call to tapply below) you can get mean price of diamonds by Cut and Color: