R 中的等效案例陈述

发布于 2024-10-10 23:01:31 字数 153 浏览 3 评论 0原文

我在数据框中有一个变量,其中一个字段通常有 7-8 个值。我想将它们折叠到数据框中的新变量中的 3 或 4 个新类别。最好的方法是什么?

如果我使用类似 SQL 的工具,但不确定如何在 R 中攻击它,我会使用 CASE 语句。

我们将非常感谢您提供的任何帮助!

I have a variable in a dataframe where one of the fields typically has 7-8 values. I want to collpase them 3 or 4 new categories within a new variable within the dataframe. What is the best approach?

I would use a CASE statement if I were in a SQL-like tool but not sure how to attack this in R.

Any help you can provide will be much appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(17

夜还是长夜 2024-10-17 23:01:31

case_when() 于 2016 年 5 月添加到 dplyr,以类似于 memisc::cases() 的方式解决了这个问题。

从 dplyr 0.7.0 开始,例如:

mtcars %>% 
  mutate(category = case_when(
    cyl == 4 & disp < median(disp) ~ "4 cylinders, small displacement",
    cyl == 8 & disp > median(disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)

原始答案

library(dplyr)
mtcars %>% 
  mutate(category = case_when(
    .$cyl == 4 & .$disp < median(.$disp) ~ "4 cylinders, small displacement",
    .$cyl == 8 & .$disp > median(.$disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)

case_when(), which was added to dplyr in May 2016, solves this problem in a manner similar to memisc::cases().

As of dplyr 0.7.0, for example:

mtcars %>% 
  mutate(category = case_when(
    cyl == 4 & disp < median(disp) ~ "4 cylinders, small displacement",
    cyl == 8 & disp > median(disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)

Original answer

library(dplyr)
mtcars %>% 
  mutate(category = case_when(
    .$cyl == 4 & .$disp < median(.$disp) ~ "4 cylinders, small displacement",
    .$cyl == 8 & .$disp > median(.$disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)
我很OK 2024-10-17 23:01:31

这是使用 switch 语句的一种方法:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = FALSE)
df$type <- sapply(df$name, switch, 
                  cow = 'animal', 
                  pig = 'animal', 
                  eagle = 'bird', 
                  pigeon = 'bird')

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird

这种方法的一个缺点是您必须不断地为每个项目编写类别名称(animal 等)。从语法上讲,能够如下定义我们的类别更方便(请参阅非常相似的问题如何在 R 的数据框中添加列

myMap <- list(animal = c('cow', 'pig'), bird = c('eagle', 'pigeon'))

,我们希望以某种方式“反转”此映射。我编写了自己的 invMap 函数:

invMap <- function(map) {
  items <- as.character( unlist(map) )
  nams <- unlist(Map(rep, names(map), sapply(map, length)))
  names(nams) <- items
  nams
}

然后按如下方式反转上面的映射:

> invMap(myMap)
     cow      pig    eagle   pigeon 
"animal" "animal"   "bird"   "bird" 

然后很容易使用它在数据框中添加 type 列:

df <- transform(df, type = invMap(myMap)[name])

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird

Here's a way using the switch statement:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = FALSE)
df$type <- sapply(df$name, switch, 
                  cow = 'animal', 
                  pig = 'animal', 
                  eagle = 'bird', 
                  pigeon = 'bird')

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird

The one downside of this is that you have to keep writing the category name (animal, etc) for each item. It is syntactically more convenient to be able to define our categories as below (see the very similar question How do add a column in a data frame in R )

myMap <- list(animal = c('cow', 'pig'), bird = c('eagle', 'pigeon'))

and we want to somehow "invert" this mapping. I write my own invMap function:

invMap <- function(map) {
  items <- as.character( unlist(map) )
  nams <- unlist(Map(rep, names(map), sapply(map, length)))
  names(nams) <- items
  nams
}

and then invert the above map as follows:

> invMap(myMap)
     cow      pig    eagle   pigeon 
"animal" "animal"   "bird"   "bird" 

And then it's easy to use this to add the type column in the data-frame:

df <- transform(df, type = invMap(myMap)[name])

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird
初吻给了烟 2024-10-17 23:01:31

查看 memisc 包中的 cases 函数。它通过两种不同的使用方式来实现案例功能。
从包中的示例来看:

z1=cases(
    "Condition 1"=x<0,
    "Condition 2"=y<0,# only applies if x >= 0
    "Condition 3"=TRUE
    )

其中 x 和 y 是两个向量。

参考文献: memisc 包案例示例

Have a look at the cases function from the memisc package. It implements case-functionality with two different ways to use it.
From the examples in the package:

z1=cases(
    "Condition 1"=x<0,
    "Condition 2"=y<0,# only applies if x >= 0
    "Condition 3"=TRUE
    )

where x and y are two vectors.

References: memisc package, cases example

魔法唧唧 2024-10-17 23:01:31

我没有看到任何关于“切换”的建议。代码示例(运行它):

x <- "three"
y <- 0
switch(x,
       one = {y <- 5},
       two = {y <- 12},
       three = {y <- 432})
y

I see no proposal for 'switch'. Code example (run it):

x <- "three"
y <- 0
switch(x,
       one = {y <- 5},
       two = {y <- 12},
       three = {y <- 432})
y
心病无药医 2024-10-17 23:01:31

如果你有 factor 那么你可以通过标准方法改变级别:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
             stringsAsFactors = FALSE)
df$type <- factor(df$name) # First step: copy vector and make it factor
# Change levels:
levels(df$type) <- list(
    animal = c("cow", "pig"),
    bird = c("eagle", "pigeon")
)
df
#     name   type
# 1    cow animal
# 2    pig animal
# 3  eagle   bird
# 4 pigeon   bird

你可以编写简单的函数作为包装器:

changelevels <- function(f, ...) {
    f <- as.factor(f)
    levels(f) <- list(...)
    f
}

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = TRUE)

df$type <- changelevels(df$name, animal=c("cow", "pig"), bird=c("eagle", "pigeon"))

If you got factor then you could change levels by standard method:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
             stringsAsFactors = FALSE)
df$type <- factor(df$name) # First step: copy vector and make it factor
# Change levels:
levels(df$type) <- list(
    animal = c("cow", "pig"),
    bird = c("eagle", "pigeon")
)
df
#     name   type
# 1    cow animal
# 2    pig animal
# 3  eagle   bird
# 4 pigeon   bird

You could write simple function as a wrapper:

changelevels <- function(f, ...) {
    f <- as.factor(f)
    levels(f) <- list(...)
    f
}

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = TRUE)

df$type <- changelevels(df$name, animal=c("cow", "pig"), bird=c("eagle", "pigeon"))
从﹋此江山别 2024-10-17 23:01:31

恕我直言,最简单和通用的代码:

dft=data.frame(x = sample(letters[1:8], 20, replace=TRUE))
dft=within(dft,{
    y=NA
    y[x %in% c('a','b','c')]='abc'
    y[x %in% c('d','e','f')]='def'
    y[x %in% 'g']='g'
    y[x %in% 'h']='h'
})

Imho, most straightforward and universal code:

dft=data.frame(x = sample(letters[1:8], 20, replace=TRUE))
dft=within(dft,{
    y=NA
    y[x %in% c('a','b','c')]='abc'
    y[x %in% c('d','e','f')]='def'
    y[x %in% 'g']='g'
    y[x %in% 'h']='h'
})
鹿! 2024-10-17 23:01:31

有一个 switch 语句,但我似乎永远无法让它按照我认为应该的方式工作。由于您没有提供示例,我将使用因子变量制作一个示例:

 dft <-data.frame(x = sample(letters[1:8], 20, replace=TRUE))
 levels(dft$x)
[1] "a" "b" "c" "d" "e" "f" "g" "h"

如果您按照适合重新分配的顺序指定所需的类别,则可以使用因子或数字变量作为索引:

c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x]
 [1] "def" "h"   "g"   "def" "def" "abc" "h"   "h"   "def" "abc" "abc" "abc" "h"   "h"   "abc"
[16] "def" "abc" "abc" "def" "def"

dft$y <- c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x] str(dft)
'data.frame':   20 obs. of  2 variables:
 $ x: Factor w/ 8 levels "a","b","c","d",..: 4 8 7 4 6 1 8 8 5 2 ...
 $ y: chr  "def" "h" "g" "def" ...

我后来了解到确实有两个不同的开关功能。它不是通用函数,但您应该将其视为 switch.numericswitch.character。如果您的第一个参数是 R '因子',您会得到 switch.numeric 行为,这可能会导致问题,因为大多数人看到因子显示为字符,并错误地假设所有函数都会处理他们本身。

There is a switch statement but I can never seem to get it to work the way I think it should. Since you have not provided an example I will make one using a factor variable:

 dft <-data.frame(x = sample(letters[1:8], 20, replace=TRUE))
 levels(dft$x)
[1] "a" "b" "c" "d" "e" "f" "g" "h"

If you specify the categories you want in an order appropriate to the reassignment you can use the factor or numeric variables as an index:

c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x]
 [1] "def" "h"   "g"   "def" "def" "abc" "h"   "h"   "def" "abc" "abc" "abc" "h"   "h"   "abc"
[16] "def" "abc" "abc" "def" "def"

dft$y <- c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x] str(dft)
'data.frame':   20 obs. of  2 variables:
 $ x: Factor w/ 8 levels "a","b","c","d",..: 4 8 7 4 6 1 8 8 5 2 ...
 $ y: chr  "def" "h" "g" "def" ...

I later learned that there really are two different switch functions. It's not generic function but you should think about it as either switch.numeric or switch.character. If your first argument is an R 'factor', you get switch.numeric behavior, which is likely to cause problems, since most people see factors displayed as character and make the incorrect assumption that all functions will process them as such.

迟到的我 2024-10-17 23:01:31

我在您提到的那些情况下使用 switch()。它看起来像一个控制语句,但实际上它是一个函数。计算表达式并根据该值返回列表中的相应项目。

switch 以两种不同的方式工作,具体取决于第一个参数的计算结果是字符串还是数字。

下面是一个简单的字符串示例,它解决了将旧类别折叠为新类别的问题。

对于字符串形式,在命名值后面有一个未命名参数作为默认参数。

newCat <- switch(EXPR = category,
       cat1   = catX,
       cat2   = catX,
       cat3   = catY,
       cat4   = catY,
       cat5   = catZ,
       cat6   = catZ,
       "not available")

I am using in those cases you are referring switch(). It looks like a control statement but actually, it is a function. The expression is evaluated and based on this value, the corresponding item in the list is returned.

switch works in two distinct ways depending whether the first argument evaluates to a character string or a number.

What follows is a simple string example which solves your problem to collapse old categories to new ones.

For the character-string form, have a single unnamed argument as the default after the named values.

newCat <- switch(EXPR = category,
       cat1   = catX,
       cat2   = catX,
       cat3   = catY,
       cat4   = catY,
       cat5   = catZ,
       cat6   = catZ,
       "not available")
偷得浮生 2024-10-17 23:01:31

您可以使用汽车包中的重新编码:

library(ggplot2) #get data
library(car)
daimons$new_var <- recode(diamonds$clarity , "'I1' = 'low';'SI2' = 'low';else = 'high';")[1:10]

You can use recode from the car package:

library(ggplot2) #get data
library(car)
daimons$new_var <- recode(diamonds$clarity , "'I1' = 'low';'SI2' = 'low';else = 'high';")[1:10]
左岸枫 2024-10-17 23:01:31

我不喜欢其中任何一个,读者或潜在用户不清楚它们。我只是使用一个匿名函数,语法不像 case 语句那么流畅,但是评估类似于 case 语句,并且没有那么痛苦。这还假设您在定义变量的地方对其进行评估。

result <- ( function() { if (x==10 | y< 5) return('foo') 
                         if (x==11 & y== 5) return('bar')
                        })()

所有这些 () 都是封装和评估匿名函数所必需的。

i dont like any of these, they are not clear to the reader or the potential user. I just use an anonymous function, the syntax is not as slick as a case statement, but the evaluation is similar to a case statement and not that painful. this also assumes your evaluating it within where your variables are defined.

result <- ( function() { if (x==10 | y< 5) return('foo') 
                         if (x==11 & y== 5) return('bar')
                        })()

all of those () are necessary to enclose and evaluate the anonymous function.

孤独患者 2024-10-17 23:01:31

从 data.table 开始v1.13.0 您可以使用函数 fcase() (fast-case) 执行类似 SQL 的 CASE 操作(也类似于 dplyr: :case_when()):

require(data.table)

dt <- data.table(name = c('cow','pig','eagle','pigeon','cow','eagle'))
dt[ , category := fcase(name %in% c('cow', 'pig'), 'mammal',
                        name %in% c('eagle', 'pigeon'), 'bird') ]

As of data.table v1.13.0 you can use the function fcase() (fast-case) to do SQL-like CASE operations (also similar to dplyr::case_when()):

require(data.table)

dt <- data.table(name = c('cow','pig','eagle','pigeon','cow','eagle'))
dt[ , category := fcase(name %in% c('cow', 'pig'), 'mammal',
                        name %in% c('eagle', 'pigeon'), 'bird') ]
明明#如月 2024-10-17 23:01:31

如果你想要类似 sql 的语法,你可以使用 sqldf 包。要使用的函数也名为sqldf,语法如下

sqldf(<your query in quotation marks>)

If you want to have sql-like syntax you can just make use of sqldf package. Tthe function to be used is also names sqldf and the syntax is as follows

sqldf(<your query in quotation marks>)
国产ˉ祖宗 2024-10-17 23:01:31

案例陈述实际上可能不是正确的方法。如果这是一个因素(很可能是),只需适当设置该因素的水平即可。

假设您有一个由字母 A 到 E 组成的因子,如下所示。

> a <- factor(rep(LETTERS[1:5],2))
> a
 [1] A B C D E A B C D E
Levels: A B C D E

要连接级别 B 和 C 并将其命名为 BC,只需将这些级别的名称更改为 BC 即可。

> levels(a) <- c("A","BC","BC","D","E")
> a
 [1] A  BC BC D  E  A  BC BC D  E 
Levels: A BC D E

结果如所愿。

A case statement actually might not be the right approach here. If this is a factor, which is likely is, just set the levels of the factor appropriately.

Say you have a factor with the letters A to E, like this.

> a <- factor(rep(LETTERS[1:5],2))
> a
 [1] A B C D E A B C D E
Levels: A B C D E

To join levels B and C and name it BC, just change the names of those levels to BC.

> levels(a) <- c("A","BC","BC","D","E")
> a
 [1] A  BC BC D  E  A  BC BC D  E 
Levels: A BC D E

The result is as desired.

人生戏 2024-10-17 23:01:31

混合 plyr::mutatedplyr::case_when 对我有用并且可读。

iris %>%
plyr::mutate(coolness =
     dplyr::case_when(Species  == "setosa"     ~ "not cool",
                      Species  == "versicolor" ~ "not cool",
                      Species  == "virginica"  ~ "super awesome",
                      TRUE                     ~ "undetermined"
       )) -> testIris
head(testIris)
levels(testIris$coolness)  ## NULL
testIris$coolness <- as.factor(testIris$coolness)
levels(testIris$coolness)  ## ok now
testIris[97:103,4:6]

如果列可以从 mutate 中出来而不是 char 作为一个因子,那就加分了! case_when 语句的最后一行捕获所有不匹配的行,这一点非常重要。

     Petal.Width    Species      coolness
 97         1.3  versicolor      not cool
 98         1.3  versicolor      not cool  
 99         1.1  versicolor      not cool
100         1.3  versicolor      not cool
101         2.5  virginica     super awesome
102         1.9  virginica     super awesome
103         2.1  virginica     super awesome

Mixing plyr::mutate and dplyr::case_when works for me and is readable.

iris %>%
plyr::mutate(coolness =
     dplyr::case_when(Species  == "setosa"     ~ "not cool",
                      Species  == "versicolor" ~ "not cool",
                      Species  == "virginica"  ~ "super awesome",
                      TRUE                     ~ "undetermined"
       )) -> testIris
head(testIris)
levels(testIris$coolness)  ## NULL
testIris$coolness <- as.factor(testIris$coolness)
levels(testIris$coolness)  ## ok now
testIris[97:103,4:6]

Bonus points if the column can come out of mutate as a factor instead of char! The last line of the case_when statement, which catches all un-matched rows is very important.

     Petal.Width    Species      coolness
 97         1.3  versicolor      not cool
 98         1.3  versicolor      not cool  
 99         1.1  versicolor      not cool
100         1.3  versicolor      not cool
101         2.5  virginica     super awesome
102         1.9  virginica     super awesome
103         2.1  virginica     super awesome
看轻我的陪伴 2024-10-17 23:01:31

您可以使用 base 函数 merge 执行 case 样式的重新映射任务:

df <- data.frame(name = c('cow','pig','eagle','pigeon','cow','eagle'), 
                 stringsAsFactors = FALSE)

mapping <- data.frame(
  name=c('cow','pig','eagle','pigeon'),
  category=c('mammal','mammal','bird','bird')
)

merge(df,mapping)
# name category
# 1    cow   mammal
# 2    cow   mammal
# 3  eagle     bird
# 4  eagle     bird
# 5    pig   mammal
# 6 pigeon     bird

You can use the base function merge for case-style remapping tasks:

df <- data.frame(name = c('cow','pig','eagle','pigeon','cow','eagle'), 
                 stringsAsFactors = FALSE)

mapping <- data.frame(
  name=c('cow','pig','eagle','pigeon'),
  category=c('mammal','mammal','bird','bird')
)

merge(df,mapping)
# name category
# 1    cow   mammal
# 2    cow   mammal
# 3  eagle     bird
# 4  eagle     bird
# 5    pig   mammal
# 6 pigeon     bird
顾挽 2024-10-17 23:01:31
com = '102'
switch (com,
    '110' = (com= '23279'),
    '101' = (com='23276'),
    '102'= (com = '23277'),
    '111' = (com = '23281'),
    '112' = (com = '23283')
)

print(com)
com = '102'
switch (com,
    '110' = (com= '23279'),
    '101' = (com='23276'),
    '102'= (com = '23277'),
    '111' = (com = '23281'),
    '112' = (com = '23283')
)

print(com)
み格子的夏天 2024-10-17 23:01:31

使用(小)参考表来映射分类通常更容易,并且使(非 R)人员更容易遵循。您甚至可以设置一个表来映射串联变量。显然,您也可以设置多个列。

It is often easier to use a (small) reference table to map the classifications and makes it easier for (non R) people to follow. You can even set up a table to map concatenated variables. You can set up multiple columns as well, obviously.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文