删除子集数据框中未使用的因子级别

发布于 2024-07-30 07:42:38 字数 601 浏览 3 评论 0原文

我有一个包含因子的数据框。 当我使用 subset 或其他索引函数创建此数据框的子集时,会创建一个新的数据框。 但是,factor 变量保留其所有原始级别,即使它们不存在于新数据帧中也是如此。

在进行分面绘图或使用依赖于因子水平的函数时,这会导致问题。

从新数据框中的因子中删除级别的最简洁方法是什么?

这是一个例子:

df <- data.frame(letters=letters[1:5],
                    numbers=seq(1:5))

levels(df$letters)
## [1] "a" "b" "c" "d" "e"

subdf <- subset(df, numbers <= 3)
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3    

# all levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"

I have a data frame containing a factor. When I create a subset of this dataframe using subset or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels, even when/if they do not exist in the new dataframe.

This causes problems when doing faceted plotting or using functions that rely on factor levels.

What is the most succinct way to remove levels from a factor in the new dataframe?

Here's an example:

df <- data.frame(letters=letters[1:5],
                    numbers=seq(1:5))

levels(df$letters)
## [1] "a" "b" "c" "d" "e"

subdf <- subset(df, numbers <= 3)
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3    

# all levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

执笔绘流年 2024-08-06 07:42:39

如果您不希望出现这种行为,请不要使用因子,而应使用字符向量。 我认为这比事后修补更有意义。 在使用 read.tableread.csv 加载数据之前,请尝试以下操作:

options(stringsAsFactors = FALSE)

缺点是您只能按字母顺序排序。 (重新排序是你绘图的朋友)

If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with read.table or read.csv:

options(stringsAsFactors = FALSE)

The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)

疯了 2024-08-06 07:42:39

这是一个已知问题,gdata 包,您的示例将

> drop.levels(subdf)
  letters numbers
1       a       1
2       b       2
3       c       3
> levels(drop.levels(subdf)$letters)
[1] "a" "b" "c"

在其中 gdata 包中还有 dropUnusedLevels 函数。 org/web/packages/Hmisc/index.html" rel="noreferrer">Hmisc 包。 但是,它只能通过更改子集运算符 [ 来起作用,并且在这里不适用。

作为推论,基于每列的直接方法是一个简单的 as.factor(as.character(data))

> levels(subdf$letters)
[1] "a" "b" "c" "d" "e"
> subdf$letters <- as.factor(as.character(subdf$letters))
> levels(subdf$letters)
[1] "a" "b" "c"

It is a known issue, and one possible remedy is provided by drop.levels() in the gdata package where your example becomes

> drop.levels(subdf)
  letters numbers
1       a       1
2       b       2
3       c       3
> levels(drop.levels(subdf)$letters)
[1] "a" "b" "c"

There is also the dropUnusedLevels function in the Hmisc package. However, it only works by altering the subset operator [ and is not applicable here.

As a corollary, a direct approach on a per-column basis is a simple as.factor(as.character(data)):

> levels(subdf$letters)
[1] "a" "b" "c" "d" "e"
> subdf$letters <- as.factor(as.character(subdf$letters))
> levels(subdf$letters)
[1] "a" "b" "c"
素衣风尘叹 2024-08-06 07:42:39

另一种方法是使用 dplyr 进行相同操作,但使用

library(dplyr)
subdf <- df %>% filter(numbers <= 3) %>% droplevels()
str(subdf)

编辑:

也可以! 感谢 agenis

subdf <- df %>% filter(numbers <= 3) %>% droplevels
levels(subdf$letters)

Another way of doing the same but with dplyr

library(dplyr)
subdf <- df %>% filter(numbers <= 3) %>% droplevels()
str(subdf)

Edit:

Also Works ! Thanks to agenis

subdf <- df %>% filter(numbers <= 3) %>% droplevels
levels(subdf$letters)
夢归不見 2024-08-06 07:42:39

为了完整起见,现在forcats包中也有fct_drophttp://forcats.tidyverse.org/reference/fct_drop.html

它与 droplevels 的不同之处在于处理 NA 的方式:

f <- factor(c("a", "b", NA), exclude = NULL)

droplevels(f)
# [1] a    b    <NA>
# Levels: a b <NA>

forcats::fct_drop(f)
# [1] a    b    <NA>
# Levels: a b

For the sake of completeness, now there is also fct_drop in the forcats package http://forcats.tidyverse.org/reference/fct_drop.html.

It differs from droplevels in the way it deals with NA:

f <- factor(c("a", "b", NA), exclude = NULL)

droplevels(f)
# [1] a    b    <NA>
# Levels: a b <NA>

forcats::fct_drop(f)
# [1] a    b    <NA>
# Levels: a b
七色彩虹 2024-08-06 07:42:39

这是另一种方法,我认为它相当于 factor(..) 方法:

> df <- data.frame(let=letters[1:5], num=1:5)
> subdf <- df[df$num <= 3, ]

> subdf$let <- subdf$let[ , drop=TRUE]

> levels(subdf$let)
[1] "a" "b" "c"

Here's another way, which I believe is equivalent to the factor(..) approach:

> df <- data.frame(let=letters[1:5], num=1:5)
> subdf <- df[df$num <= 3, ]

> subdf$let <- subdf$let[ , drop=TRUE]

> levels(subdf$let)
[1] "a" "b" "c"
妄司 2024-08-06 07:42:39

这很令人讨厌。 这就是我通常这样做的方式,以避免加载其他包:

levels(subdf$letters)<-c("a","b","c",NA,NA)

这会让你:

> subdf$letters
[1] a b c
Levels: a b c

请注意,新级别将替换旧级别中占据其索引的任何内容(subdf$letters),因此类似:

levels(subdf$letters)<-c(NA,"a","c",NA,"b")

不起作用。

当你有很多关卡时,这显然并不理想,但对于少数关卡来说,它又快又简单。

This is obnoxious. This is how I usually do it, to avoid loading other packages:

levels(subdf$letters)<-c("a","b","c",NA,NA)

which gets you:

> subdf$letters
[1] a b c
Levels: a b c

Note that the new levels will replace whatever occupies their index in the old levels(subdf$letters), so something like:

levels(subdf$letters)<-c(NA,"a","c",NA,"b")

won't work.

This is obviously not ideal when you have lots of levels, but for a few, it's quick and easy.

贪恋 2024-08-06 07:42:39

查看 droplevels 方法 R 源代码中的代码您可以看到 它包装到 factor 函数。 这意味着您基本上可以使用 factor 函数重新创建列。
下面是 data.table 方法从所有因子列中删除级别。

library(data.table)
dt = data.table(letters=factor(letters[1:5]), numbers=seq(1:5))
levels(dt$letters)
#[1] "a" "b" "c" "d" "e"
subdt = dt[numbers <= 3]
levels(subdt$letters)
#[1] "a" "b" "c" "d" "e"

upd.cols = sapply(subdt, is.factor)
subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]
levels(subdt$letters)
#[1] "a" "b" "c"

Looking at the droplevels methods code in the R source you can see it wraps to factor function. That means you can basically recreate the column with factor function.
Below the data.table way to drop levels from all the factor columns.

library(data.table)
dt = data.table(letters=factor(letters[1:5]), numbers=seq(1:5))
levels(dt$letters)
#[1] "a" "b" "c" "d" "e"
subdt = dt[numbers <= 3]
levels(subdt$letters)
#[1] "a" "b" "c" "d" "e"

upd.cols = sapply(subdt, is.factor)
subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]
levels(subdt$letters)
#[1] "a" "b" "c"
故事与诗 2024-08-06 07:42:39

非常有趣的线程,我特别喜欢再次考虑子选择的想法。 我以前也遇到过类似的问题,我只是转换为字符,然后又返回到因子。

   df <- data.frame(letters=letters[1:5],numbers=seq(1:5))
   levels(df$letters)
   ## [1] "a" "b" "c" "d" "e"
   subdf <- df[df$numbers <= 3]
   subdf$letters<-factor(as.character(subdf$letters))

Very interesting thread, I especially liked idea to just factor subselection again. I had the similar problem before and I just converted to character and then back to factor.

   df <- data.frame(letters=letters[1:5],numbers=seq(1:5))
   levels(df$letters)
   ## [1] "a" "b" "c" "d" "e"
   subdf <- df[df$numbers <= 3]
   subdf$letters<-factor(as.character(subdf$letters))
心凉 2024-08-06 07:42:39

这是一种方法

varFactor <- factor(letters[1:15])
varFactor <- varFactor[1:5]
varFactor <- varFactor[drop=T]

here is a way of doing that

varFactor <- factor(letters[1:15])
varFactor <- varFactor[1:5]
varFactor <- varFactor[drop=T]
不一样的天空 2024-08-06 07:42:39

我编写了实用函数来执行此操作。 现在我了解了 gdata 的 drop.levels,它看起来非常相似。 它们在这里(来自此处):

present_levels <- function(x) intersect(levels(x), x)

trim_levels <- function(...) UseMethod("trim_levels")

trim_levels.factor <- function(x)  factor(x, levels=present_levels(x))

trim_levels.data.frame <- function(x) {
  for (n in names(x))
    if (is.factor(x[,n]))
      x[,n] = trim_levels(x[,n])
  x
}

I wrote utility functions to do this. Now that I know about gdata's drop.levels, it looks pretty similar. Here they are (from here):

present_levels <- function(x) intersect(levels(x), x)

trim_levels <- function(...) UseMethod("trim_levels")

trim_levels.factor <- function(x)  factor(x, levels=present_levels(x))

trim_levels.data.frame <- function(x) {
  for (n in names(x))
    if (is.factor(x[,n]))
      x[,n] = trim_levels(x[,n])
  x
}
时光磨忆 2024-08-06 07:42:39

不幸的是,当使用 RevoScaleR 的 rxDataStep 时,factor() 似乎不起作用。 我分两步进行:
1) 转换为字符并存储在临时外部数据帧(.xdf)中。
2) 转换回因子并存储在明确的外部数据框中。 这消除了任何未使用的因子级别,而无需将所有数据加载到内存中。

# Step 1) Converts to character, in temporary xdf file:
rxDataStep(inData = "input.xdf", outFile = "temp.xdf", transforms = list(VAR_X = as.character(VAR_X)), overwrite = T)
# Step 2) Converts back to factor:
rxDataStep(inData = "temp.xdf", outFile = "output.xdf", transforms = list(VAR_X = as.factor(VAR_X)), overwrite = T)

Unfortunately factor() doesn't seem to work when using rxDataStep of RevoScaleR. I do it in two steps:
1) Convert to character and store in temporary external data frame (.xdf).
2) Convert back to factor and store in definitive external data frame. This eliminates any unused factor levels, without loading all the data into memory.

# Step 1) Converts to character, in temporary xdf file:
rxDataStep(inData = "input.xdf", outFile = "temp.xdf", transforms = list(VAR_X = as.character(VAR_X)), overwrite = T)
# Step 2) Converts back to factor:
rxDataStep(inData = "temp.xdf", outFile = "output.xdf", transforms = list(VAR_X = as.factor(VAR_X)), overwrite = T)
墨小墨 2024-08-06 07:42:39

感谢您发布这个问题。 但是,上述解决方案都不适合我。 我为这个问题做了一个解决方法,分享它以防其他人偶然发现这个问题:

对于所有包含零值级别的 factor 列,您可以首先将这些列转换为 character 类型,然后将它们转换回因子

对于上面发布的问题,只需添加以下代码行:

# Convert into character
subdf$letters = as.character(subdf$letters)

# Convert back into factor
subdf$letters = as.factor(subdf$letters)

# Verify the levels in the subset
levels(subdf$letters)

Thank you for posting this question. However, none of the above solutions worked for me. I made a workaround for this problem, sharing it in case some else stumbles upon this problem:

For all factor columns that contain levels having zero values in them, you can first convert those columns into character type and then convert them back into factors.

For the above-posted question, just add the following lines of code:

# Convert into character
subdf$letters = as.character(subdf$letters)

# Convert back into factor
subdf$letters = as.factor(subdf$letters)

# Verify the levels in the subset
levels(subdf$letters)
一生独一 2024-08-06 07:42:39

真正的 droplevels 函数是 collapse::fdroplevels,它比 droplevels 快得多,并且不执行任何类型的不必要的匹配或值列表。 例子:

library(collapse)
library(microbenchmark)

# wlddev data supplied in collapse, iso3c is a factor
data <- fsubset(wlddev, iso3c %!in% "USA")

microbenchmark(fdroplevels(data), droplevels(data), unit = "relative")
## Unit: relative
##               expr  min       lq     mean   median       uq      max neval cld
##  fdroplevels(data)  1.0  1.00000  1.00000  1.00000  1.00000  1.00000   100  a 
##   droplevels(data) 30.2 29.15873 24.54175 24.86147 22.11553 14.23274   100   b

A genuine droplevels function that is much faster than droplevels and does not perform any kind of unnecessary matching or tabulation of values is collapse::fdroplevels. Example:

library(collapse)
library(microbenchmark)

# wlddev data supplied in collapse, iso3c is a factor
data <- fsubset(wlddev, iso3c %!in% "USA")

microbenchmark(fdroplevels(data), droplevels(data), unit = "relative")
## Unit: relative
##               expr  min       lq     mean   median       uq      max neval cld
##  fdroplevels(data)  1.0  1.00000  1.00000  1.00000  1.00000  1.00000   100  a 
##   droplevels(data) 30.2 29.15873 24.54175 24.86147 22.11553 14.23274   100   b
迷途知返 2024-08-06 07:42:39

已经尝试了这里的大多数示例(如果不是全部),但似乎没有一个适用于我的情况。
经过相当长一段时间的挣扎后,我尝试在因子列上使用 as.character() 将其更改为带有字符串的 col,这似乎工作得很好。

不确定性能问题。

Have tried most of the examples here if not all but none seem to be working in my case.
After struggling for quite some time I have tried using as.character() on the factor column to change it to a col with strings which seems to working just fine.

Not sure for performance issues.

无戏配角 2024-08-06 07:42:38

从 R 版本 2.12 开始,有一个 droplevels() 函数。

levels(droplevels(subdf$letters))

Since R version 2.12, there's a droplevels() function.

levels(droplevels(subdf$letters))
能怎样 2024-08-06 07:42:38

您所要做的就是在子集化后再次将 Factor() 应用于您的变量:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c

编辑

从因子页面示例:

factor(ff)      # drops the levels that do not occur

要从数据框中的所有因子列中删除级别,您可以使用:

subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)

All you should have to do is to apply factor() to your variable again after subsetting:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c

EDIT

From the factor page example:

factor(ff)      # drops the levels that do not occur

For dropping levels from all factor columns in a dataframe, you can use:

subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文