如何在 R 中的因子水平内进行中值分割?
在这里,我创建了一个新列来指示 myData 是否高于或低于其中值,
### MedianSplits based on Whole Data
#create some test data
myDataFrame=data.frame(myData=runif(15),myFactor=rep(c("A","B","C"),5))
#create column showing median split
myBreaks= quantile(myDataFrame$myData,c(0,.5,1))
myDataFrame$MedianSplitWholeData = cut(
myDataFrame$myData,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
#Check if it's correct
myDataFrame$AboveWholeMedian = myDataFrame$myData > median(myDataFrame$myData)
myDataFrame
效果很好。 现在我想做同样的事情,但计算 myFactor 每个级别内的中位数分割。
我想出了这个:
#Median splits within factor levels
byOutput=by(myDataFrame$myData,myDataFrame$myFactor, function (x) {
myBreaks= quantile(x,c(0,.5,1))
MedianSplitByGroup=cut(x,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
MedianSplitByGroup
})
byOutput 包含我想要的内容。 它对因素 A、B 和 C 的每个元素进行了正确分类。 不过,我想创建一个新列 myDataFrame$FactorLevelMedianSplit,它显示新计算的中值分割。
如何将“by”命令的输出转换为有用的数据框列?
我认为也许“by”命令不是类似 R 的方式来执行此操作...
更新:
以 Thierry 的示例为例,说明如何巧妙地使用 Factor() ,并在发现中的“ave”函数后Spector 的书,我找到了这个解决方案,不需要额外的包。
myDataFrame$MediansByFactor=ave(
myDataFrame$myData,
myDataFrame$myFactor,
FUN=median)
myDataFrame$FactorLevelMedianSplit = factor(
myDataFrame$myData>myDataFrame$MediansByFactor,
levels = c(TRUE, FALSE),
labels = c("Above", "Below"))
Here I make a new column to indicate whether myData is above or below its median
### MedianSplits based on Whole Data
#create some test data
myDataFrame=data.frame(myData=runif(15),myFactor=rep(c("A","B","C"),5))
#create column showing median split
myBreaks= quantile(myDataFrame$myData,c(0,.5,1))
myDataFrame$MedianSplitWholeData = cut(
myDataFrame$myData,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
#Check if it's correct
myDataFrame$AboveWholeMedian = myDataFrame$myData > median(myDataFrame$myData)
myDataFrame
Works fine. Now I want to do the same thing, but compute the median splits within each level of myFactor.
I've come up with this:
#Median splits within factor levels
byOutput=by(myDataFrame$myData,myDataFrame$myFactor, function (x) {
myBreaks= quantile(x,c(0,.5,1))
MedianSplitByGroup=cut(x,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
MedianSplitByGroup
})
byOutput contains what I want. It categorizes each element of factors A, B, and C correctly. However I'd like to create a new column, myDataFrame$FactorLevelMedianSplit, that shows the newly-computed median split.
How do you convert the output of the "by" command into a useful data-frame column?
I think perhaps the "by" command is not R-like way to do this ...
Update:
With Thierry's example of how to use factor() cleverly, and upon discovering the "ave" function in Spector's book, I've found this solution, which requires no additional packages.
myDataFrame$MediansByFactor=ave(
myDataFrame$myData,
myDataFrame$myFactor,
FUN=median)
myDataFrame$FactorLevelMedianSplit = factor(
myDataFrame$myData>myDataFrame$MediansByFactor,
levels = c(TRUE, FALSE),
labels = c("Above", "Below"))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是使用 plyr 包的解决方案。
Here is a solution using the plyr package.
这是一种 hack 式的方法。 Hadley 可能会带来更优雅的东西:
首先,我们简单地连接
by
输出:重要的是我们在这里得到因子级别 1 和 2,我们可以用它们来重新索引一个新因子这些级别:
然后我们可以将其分配到您想要修改的
data.frame
中:更新:没关系,我们需要重新索引 myDataFrame 以进行 AA 排序.. .AB ... BC ... C 在我们添加新列之前也是如此。 留作练习...
Here is a hack-ish way. Hadley may come with something more elegant:
To start, we simple concatenate the
by
output:and what matters that we get the factor levels 1 and 2 here which we can use to re-index a new factor with those levels:
which we can then assign into the
data.frame
you wanted to modify:Update: Never mind, we'd need to reindex myDataFrame to be sorted A A ... A B ... B C ... C as well before we add the new column. Left as an exercise...
你不是在寻找这样的东西,是吗?
You weren't looking for something like this, were you?