R 中的因素:不仅仅是烦恼?

发布于 2024-09-14 07:13:44 字数 134 浏览 10 评论 0原文

R 中的基本数据类型之一是因子。根据我的经验,因素基本上是一种痛苦,我从不使用它们。我总是转换为字符。我感觉很奇怪,好像我错过了什么。

是否存在一些使用因子作为需要因子数据类型的分组变量的函数的重要示例?是否存在我应该使用因子的特定情况?

One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.

Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

温柔戏命师 2024-09-21 07:13:44

你应该使用因素。是的,它们可能很痛苦,但我的理论是,它们之所以令人痛苦,90% 是因为在 read.tableread.csv 中,参数stringsAsFactors = TRUE(大多数用户都忽略了这个微妙之处)。我说它们很有用,因为像 lme4 这样的模型拟合包使用因子和有序因子来差异化拟合模型并确定要使用的对比类型。绘图包也使用它们进行分组。 ggplot 和大多数模型拟合函数将字符向量强制为因子,因此结果是相同的。但是,您最终会在代码中收到警告:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

警告消息:在model.matrix.default(mt,mf,contrast)中:

变量物种转换为因子

一个棘手的事情是整个drop=TRUE 位。在向量中,这可以很好地消除数据中不存在的因素水平。例如:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

但是,使用 data.frame 时,[.data.frame() 的行为是不同的:请参阅此电子邮件?"[.data.frame"< /代码>。在 data.frame 上使用 drop=TRUE 并不像您想象的那样工作:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

幸运的是,您可以使用 droplevels() 轻松删除因子删除单个因子或 data.frame 中每个因子的未使用因子级别(自 R 2.12 起):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa"     "versicolor" "virginica" 
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

这是如何防止您选择的级别进入 ggplot代码>传说。

在内部,factor 是具有属性级别字符向量的整数(请参阅 attributes(iris$Species)class(attributes(iris$Species)$levels)< /code>),这是干净的。如果您必须更改关卡名称(并且您使用的是字符串),那么这将是一个效率低得多的操作。我经常更改关卡名称,尤其是 ggplot 图例。如果您使用字符向量伪造因子,则存在仅更改一个元素并意外创建单独的新关卡的风险。

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

Warning message: In model.matrix.default(mt, mf, contrasts) :

variable Species converted to a factor

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa"     "versicolor" "virginica" 
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

不奢求什么 2024-09-21 07:13:44

有序因素很棒,如果我碰巧喜欢橙子,讨厌苹果,但不介意葡萄,我不需要管理一些奇怪的索引来这么说:

d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]

ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:

d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]
深空失忆 2024-09-21 07:13:44

因子与其他语言中的枚举类型最相似。它的适当用途是对于只能采用规定的一组值之一的变量。在这些情况下,并非每个可能的允许值都可能出现在任何特定的数据集中,并且“空”级别准确地反映了这一点。

考虑一些例子。对于在美国各地收集的某些数据,应将州记录为一个因素。在这种情况下,没有从特定州收集任何病例这一事实是相关的。可能有来自该状态的数据,但碰巧(无论出于何种原因,这可能是一个令人感兴趣的原因)没有。如果收集了家乡,那就不是一个因素了。没有预先规定的一组可能的家乡。如果数据是从三个城镇而不是全国收集的,则该城镇将是一个因素:一开始就给出了三个选择,如果在这三个城镇之一没有发现相关病例/数据,则该城镇是相关的。

因子的其他方面,例如提供一种为一组字符串提供任意排序顺序的方法,是因子的有用的次要特征,但不是原因为了他们的存在。

A factor is most analogous to an enumerated type in other languages. Its appropriate use is for a variable which can only take on one of prescribed set of values. In these cases, not every possible allowed value may be present in any particular set of data and the "empty" levels accurately reflect that.

Consider some examples. For some data which was collected all across the United States, the state should be recorded as a factor. In this case, the fact that no cases were collected from a particular state is relevant. There could have been data from that state, but there happened (for whatever reason, which may be a reason of interest) to not be. If hometown was collected, it would not be a factor. There is not a pre-stated set of possible hometowns. If data were collected from three towns rather than nationally, the town would be a factor: there are three choices that were given at the outset and if no relevant cases/data were found in one of those three towns, that is relevant.

Other aspects of factors, such as providing a way to give an arbitrary sort order to a set of strings, are useful secondary characteristics of factors, but are not the reason for their existence.

小糖芽 2024-09-21 07:13:44

当人们进行统计分析和实际探索数据时,因素是非常棒的。然而,在此之前,当人们读取、清理、排除故障、合并和一般操作数据时,因素是一种完全痛苦的事情。最近,与过去几年一样,许多功能都得到了改进,可以更好地处理这些因素。例如,rbind 可以很好地配合它们。我仍然发现在子集函数之后留下空的级别是非常麻烦的。

#drop a whole bunch of unused levels from a whole bunch of columns that are factors using gdata
require(gdata)
drop.levels(dataframe)

我知道重新编码因子的水平并重新调整标签很简单,而且还有一些很好的方法可以对水平进行重新排序。我的大脑就是记不住它们,每次使用时我都必须重新学习。重新编码应该比现在容易得多。

R 的字符串函数使用起来非常简单且符合逻辑。因此,在操纵时,我通常更喜欢角色而不是因素。

Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain. More recently, as in the past few years a lot of the functions have improved to handle the factors better. For instance, rbind plays nicely with them. I still find it a total nuisance to have left over empty levels after a subset function.

#drop a whole bunch of unused levels from a whole bunch of columns that are factors using gdata
require(gdata)
drop.levels(dataframe)

I know that it is straightforward to recode levels of a factor and to rejig the labels and there are also wonderful ways to reorder the levels. My brain just cannot remember them and I have to relearn it every time I use it. Recoding should just be a lot easier than it is.

R's string functions are quite easy and logical to use. So when manipulating I generally prefer characters over factors.

入怼 2024-09-21 07:13:44

多么讽刺的标题啊!

我相信许多估计函数允许您使用因子来轻松定义虚拟变量......但我不使用它们。

当我有非常大的字符向量且几乎没有独特的观察结果时,我会使用它们。这可以减少内存消耗,特别是当字符向量中的字符串较长时。

PS-我是在开玩笑这个标题。我看到了你的推文。 ;-)

What a snarky title!

I believe many estimation functions allow you to use factors to easily define dummy variables... but I don't use them for that.

I use them when I have very large character vectors with few unique observations. This can cut down on memory consumption, especially if the strings in the character vector are longer-ish.

PS - I'm joking about the title. I saw your tweet. ;-)

别靠近我心 2024-09-21 07:13:44

Factors 是一个出色的“独特案例”徽章引擎。我已经多次糟糕地重现了这一点,尽管偶尔会出现一些皱纹,但它们非常强大。

library(dplyr)
d <- tibble(x = sample(letters[1:10], 20, replace = TRUE))

## normalize this table into an indexed value across two tables
id <- tibble(x_u = sort(unique(d$x))) %>% mutate(x_i = row_number())
di <- tibble(x_i = as.integer(factor(d$x)))


## reconstruct d$x when needed
d2 <- inner_join(di, id) %>% transmute(x = x_u)
identical(d, d2)
## [1] TRUE

如果有更好的方法来完成这项任务,我很乐意看到它,但我没有看到讨论 factor 的这种功能。

Factors are an excellent "unique-cases" badging engine. I've recreated this badly many times, and despite a couple of wrinkles occasionally, they are extremely powerful.

library(dplyr)
d <- tibble(x = sample(letters[1:10], 20, replace = TRUE))

## normalize this table into an indexed value across two tables
id <- tibble(x_u = sort(unique(d$x))) %>% mutate(x_i = row_number())
di <- tibble(x_i = as.integer(factor(d$x)))


## reconstruct d$x when needed
d2 <- inner_join(di, id) %>% transmute(x = x_u)
identical(d, d2)
## [1] TRUE

If there's a better way to do this task I'd love to see it, I don't see this capability of factor discussed.

妖妓 2024-09-21 07:13:44

只有使用因素,我们才能通过将NA设置为因素级别​​来处理NA这很方便,因为许多函数都省略了NA值。让我们生成一些玩具数据:

df <- data.frame(x= rnorm(10), g= c(sample(1:2, 9, replace= TRUE), NA))

如果我们想要按 g 分组 x 的平均值,我们可以使用

aggregate(x ~ g, df, mean)
  g          x
1 1  1.0415156
2 2 -0.3071171

如您所见,我们没有得到 x 的平均值> 对于 gNA 的情况。如果我们使用 by 代替(请参阅 by(df$x, list(df$g),mean)),也会出现同样的问题。还有许多其他类似的示例,其中函数(默认情况下或一般情况下)不考虑 NA

但我们可以添加 NA 作为因子水平。请参阅此处:

aggregate(x ~ addNA(g), df, mean)
  addNA(g)          x
1        1 -0.2907772
2        2 -0.2647040
3     <NA>  1.1647002

是的,我们看到了 x 的平均值,其中 g 具有 NA。有人可能会说,使用 paste0 可以得到相同的输出,这是正确的(尝试 aggregate(x ~ Paste0(g), df,mean))。但只有通过 addNA,我们才能将 NA 反向转换为实际的缺失值。因此,我们首先用 addNA 转换 g,然后对其进行反向转换:

df$g_addNA <- addNA(df$g)
df$g_back <- factor(as.character(df$g_addNA))
 [1] 2    2    1    1    1    2    2    1    1    <NA>
Levels: 1 2

现在 g_back 中的 NA 是实际缺失的。请参阅返回 TRUEany(is.na(df$g_back))

这甚至在奇怪的情况下也有效,其中 "NA" 是原始向量中的值!例如,向量 vec <- c("a", "NA", NA) 可以使用 vec_addNA <- addNA(vec) 进行变换,我们可以 转换

as.character(vec_addNA)
[1] "a"  "NA" NA

另一方面,据我所知,我们无法对 vec_paste0 <- Paste0(vec) 进行反向 ,因为在 vec_paste0 中,"NA" 和 NA 是相同的!请参阅

vec_paste0
[1] "a"  "NA" "NA"

我以“只有使用因子,我们才能通过将它们设置为因子级别来处理 NA”来开始回答。事实上,我会小心使用 addNA,但不管与 addNA 相关的风险如何,事实是字符没有类似的选项。

Only with factors we can handle NAs by setting them as factor level. This is handy because many functions leave out NA values. Let's generate some toy data:

df <- data.frame(x= rnorm(10), g= c(sample(1:2, 9, replace= TRUE), NA))

If we want means of x grouped by g we can use

aggregate(x ~ g, df, mean)
  g          x
1 1  1.0415156
2 2 -0.3071171

As you can see we do not get the mean of x for the case where g is an NA. Same problem is true if we use by instead (see by(df$x, list(df$g), mean)). There are many other similiar examples where functions (by default or in general) do not consider NAs.

But we can add NA as a factor level. See here:

aggregate(x ~ addNA(g), df, mean)
  addNA(g)          x
1        1 -0.2907772
2        2 -0.2647040
3     <NA>  1.1647002

Yeah, we see the mean of x where g has NAs. One could argue that same output is possible with paste0 which is true (try aggregate(x ~ paste0(g), df, mean)). But only with addNA we can backtransform the NAs to actual missings. So let's firstly transform g with addNA and then backtransform it:

df$g_addNA <- addNA(df$g)
df$g_back <- factor(as.character(df$g_addNA))
 [1] 2    2    1    1    1    2    2    1    1    <NA>
Levels: 1 2

Now the NAs in g_back are actual missings. See any(is.na(df$g_back)) which returns a TRUE.

This even works in strange situations where "NA" was a value in the original vector! For example, the vector vec <- c("a", "NA", NA) can be transformed using vec_addNA <- addNA(vec) and we can actually backtransform this with

as.character(vec_addNA)
[1] "a"  "NA" NA

On the other hand, to my knowledge we can not backtransform vec_paste0 <- paste0(vec) because in vec_paste0 the "NA" and the NA are the same! See

vec_paste0
[1] "a"  "NA" "NA"

I started the answer with "Only with factors we can handle NAs by setting them as factor level.". In fact I would be careful using addNA but regardless of the risk associated with addNA the fact stands that there is no similiar option for characters.

未蓝澄海的烟 2024-09-21 07:13:44

点击(和聚合)取决于因素。这些功能的信息与工作量之比非常高。

例如,在一行代码中(下面调用tapply),您可以按切工和颜色获取钻石的平均价格:

> data(diamonds, package="ggplot2")

> head(dm)

   Carat     Cut    Clarity Price Color
1  0.23     Ideal     SI2   326     E
2  0.21   Premium     SI1   326     E
3  0.23      Good     VS1   327     E


> tx = with(diamonds, tapply(X=Price, INDEX=list(Cut=Cut, Color=Color), FUN=mean))

> a = sort(1:diamonds(tx)[2], decreasing=T)  # reverse columns for readability

> tx[,a]

         Color
Cut         J    I    H    G    F    E    D
Fair      4976 4685 5136 4239 3827 3682 4291
Good      4574 5079 4276 4123 3496 3424 3405
Very Good 5104 5256 4535 3873 3779 3215 3470
Premium   6295 5946 5217 4501 4325 3539 3631
Ideal     4918 4452 3889 3721 3375 2598 2629

tapply (and aggregate) rely on factors. The information-to-effort ratio of these functions is very high.

For instance, in a single line of code (the call to tapply below) you can get mean price of diamonds by Cut and Color:

> data(diamonds, package="ggplot2")

> head(dm)

   Carat     Cut    Clarity Price Color
1  0.23     Ideal     SI2   326     E
2  0.21   Premium     SI1   326     E
3  0.23      Good     VS1   327     E


> tx = with(diamonds, tapply(X=Price, INDEX=list(Cut=Cut, Color=Color), FUN=mean))

> a = sort(1:diamonds(tx)[2], decreasing=T)  # reverse columns for readability

> tx[,a]

         Color
Cut         J    I    H    G    F    E    D
Fair      4976 4685 5136 4239 3827 3682 4291
Good      4574 5079 4276 4123 3496 3424 3405
Very Good 5104 5256 4535 3873 3779 3215 3470
Premium   6295 5946 5217 4501 4325 3539 3631
Ideal     4918 4452 3889 3721 3375 2598 2629
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文