根据范围在 R 中创建分类变量

发布于 2024-08-29 12:31:45 字数 303 浏览 9 评论 0原文

我有一个包含整数列的数据框,我想将其用作创建新分类变量的参考。我想将变量分为三组并自己设置范围(即0-5、6-10等)。我尝试了 cut 但它根据正态分布将变量分为几组,而我的数据是右偏的。我还尝试使用 if/then 语句,但这会输出 true/false 值,我想保留原始变量。我确信有一种简单的方法可以做到这一点,但我似乎无法弄清楚。关于快速完成此操作的简单方法有什么建议吗?

我心里有这样的想法:

x   x.range
3   0-5
4   0-5
6   6-10
12  11-15

I have a dataframe with a column of integers that I would like to use as a reference to make a new categorical variable. I want to divide the variable into three groups and set the ranges myself (ie 0-5, 6-10, etc). I tried cut but that divides the variable into groups based on a normal distribution and my data is right skewed. I have also tried to use if/then statements but this outputs a true/false value and I would like to keep my original variable. I am sure that there is a simple way to do this but I cannot seem to figure it out. Any advice on a simple way to do this quickly?

I had something in mind like this:

x   x.range
3   0-5
4   0-5
6   6-10
12  11-15

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

云淡风轻 2024-09-05 12:31:45
x <- rnorm(100,10,10)
cut(x,c(-Inf,0,5,6,10,Inf))
x <- rnorm(100,10,10)
cut(x,c(-Inf,0,5,6,10,Inf))
苦妄 2024-09-05 12:31:45

据我所知,伊恩的回答(cut)是最常见的方法。

我更喜欢使用shingle,来自Lattice包的

指定分箱间隔的参数对我来说似乎更直观一些。

你可以像这样使用shingle

# mock some data
data = sample(0:40, 200, replace=T)

a = c(0, 5);b = c(5,9);c = c(9, 19);d = c(19, 33);e = c(33, 41)

my_bins = matrix(rbind(a, b, c, d, e), ncol=2)

# returns: (the binning intervals i've set)
        [,1] [,2]
 [1,]    0    5
 [2,]    5    9
 [3,]    9   19
 [4,]   19   33
 [5,]   33   41

shx = shingle(data, intervals=my_bins)

#'shx' at the interactive prompt will give you a nice frequency table:
# Intervals:
   min max count
1   0   5    23
2   5   9    17
3   9  19    56
4  19  33    76
5  33  41    46

Ian's answer (cut) is the most common way to do this, as far as i know.

I prefer to use shingle, from the Lattice Package

the argument that specifies the binning intervals seems a little more intuitive to me.

you use shingle like so:

# mock some data
data = sample(0:40, 200, replace=T)

a = c(0, 5);b = c(5,9);c = c(9, 19);d = c(19, 33);e = c(33, 41)

my_bins = matrix(rbind(a, b, c, d, e), ncol=2)

# returns: (the binning intervals i've set)
        [,1] [,2]
 [1,]    0    5
 [2,]    5    9
 [3,]    9   19
 [4,]   19   33
 [5,]   33   41

shx = shingle(data, intervals=my_bins)

#'shx' at the interactive prompt will give you a nice frequency table:
# Intervals:
   min max count
1   0   5    23
2   5   9    17
3   9  19    56
4  19  33    76
5  33  41    46
青萝楚歌 2024-09-05 12:31:45

我们可以使用 cutr 包中的 smart_cut

devtools::install_github("moodymudskipper/cutr")
library(cutr)

x <- c(3,4,6,12)

从 1 开始以长度为 5 的间隔进行切割:

smart_cut(x,list(5,1),"width" , simplify=FALSE)
# [1] [1,6)   [1,6)   [6,11)  [11,16]
# Levels: [1,6) < [6,11) < [11,16]

要准确获得您请求的输出:

smart_cut(x,c(0,6,11,16), labels = ~paste0(.y[1],'-',.y[2]-1), simplify=FALSE, open_end = TRUE)
# [1]   0-5   0-5  6-10 11-15
# Levels:   0-5 <  6-10 < 11-15

有关 cutr 和 smart_cut 的更多信息

We can use smart_cut from package cutr:

devtools::install_github("moodymudskipper/cutr")
library(cutr)

x <- c(3,4,6,12)

To cut with intervals of length 5 starting on 1 :

smart_cut(x,list(5,1),"width" , simplify=FALSE)
# [1] [1,6)   [1,6)   [6,11)  [11,16]
# Levels: [1,6) < [6,11) < [11,16]

To get exactly your requested output :

smart_cut(x,c(0,6,11,16), labels = ~paste0(.y[1],'-',.y[2]-1), simplify=FALSE, open_end = TRUE)
# [1]   0-5   0-5  6-10 11-15
# Levels:   0-5 <  6-10 < 11-15

more on cutr and smart_cut

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文