按多列对数据框行进行排序(排序)

发布于 2024-08-03 02:29:15 字数 342 浏览 4 评论 0原文

我想按多列对数据框进行排序。例如,对于下面的数据框,我想按列“z”(降序)排序,然后按列“b”(升序)排序:

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

I want to sort a data frame by multiple columns. For example, with the data frame below I would like to sort by column 'z' (descending) then by column 'b' (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(22

潦草背影 2024-08-10 02:29:15

您可以使用 order()< /code>直接运行,无需借助附加工具 - 请参阅这个更简单的答案,它使用了 example(order) 代码顶部的技巧:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

编辑一些2年多后:有人问如何通过列索引来做到这一点。答案是简单地将所需的排序列传递给 order() 函数:

R> dd[order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R> 

而不是使用列的名称(和 with() 更容易/更直接的访问)。

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

R> dd[order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R> 

rather than using the name of the column (and with() for easier/more direct access).

小忆控 2024-08-10 02:29:15

您可以选择

  • order 来自 base
  • arrange 来自 dplyr
  • setordersetorderv< /code> 来自 data.table
  • arrange 来自 plyr
  • sort 来自 taRifx
  • orderBy 来自 doBy
  • sortData 来自 Deducer

大多数时候您应该使用 dplyr 或 < code>data.table 解决方案,除非无依赖项很重要,在这种情况下使用 base::order


我最近将 sort.data.frame 添加到 CRAN 包中,使其类兼容,如下所述:
为排序创建通用/方法一致性的最佳方法.data.frame?

因此,给定data.frame dd,你可以这样排序:

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(taRifx)
sort(dd, f= ~ -z + b )

如果你是这个函数的原作者之一,请联系我。关于公共领域的讨论在这里:https://chat.stackoverflow.com/transcript/message/1094290#1094290< /a>


您还可以使用 plyr 中的 arrange() 函数,正如 Hadley 在上面的线程中指出的那样:

library(plyr)
arrange(dd,desc(z),b)

基准:请注意,我在新的 R 会话中加载了每个包因为有很多冲突。特别是,加载 doBy 包会导致 sort 返回“The following object(s) are masked from 'x (position 17)': b, x, y, z”,并且加载 Deducer 包会覆盖来自 Kevin Wright 或 taRifx 包的 sort.data.frame

#Load each time
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(microbenchmark)

# Reload R between benchmarks
microbenchmark(dd[with(dd, order(-z, b)), ] ,
    dd[order(-dd$z, dd$b),],
    times=1000
)

中位数时间:

dd[with(dd, order(-z, b)), ] 778

dd[order(-dd$z, dd$b) ,] 788

library(taRifx)
microbenchmark(sort(dd, f= ~-z+b ),times=1000)

中位时间:1,567

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=1000)

中位时间:862

library(doBy)
microbenchmark(orderBy(~-z+b, data=dd),times=1000)

中位时间:1,694

请注意,doBy加载包需要很长时间。

library(Deducer)
microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)

无法加载 Deducer。需要 JGR 控制台。

esort <- function(x, sortvar, ...) {
attach(x)
x <- x[with(x,order(sortvar,...)),]
return(x)
detach(x)
}

microbenchmark(esort(dd, -z, b),times=1000)

由于附加/分离,似乎与微基准测试不兼容。


m <- microbenchmark(
  arrange(dd,desc(z),b),
  sort(dd, f= ~-z+b ),
  dd[with(dd, order(-z, b)), ] ,
  dd[order(-dd$z, dd$b),],
  times=1000
  )

uq <- function(x) { fivenum(x)[4]}  
lq <- function(x) { fivenum(x)[2]}

y_min <- 0 # min(by(m$time,m$expr,lq))
y_max <- max(by(m$time,m$expr,uq)) * 1.05
  
p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) 
p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))

微基准图

(线从下四分位数延伸到上四分位数,点是中位数)


鉴于这些结果并权衡简单性与速度,我必须点头排列plyr中。它具有简单的语法,但速度几乎与具有复杂机制的基本 R 命令一样快。哈德利·威克姆 (Hadley Wickham) 的典型杰出作品。我对它唯一的抱怨是它打破了标准的 R 命名法,其中排序对象由 sort(object) 调用,但我理解为什么 Hadley 这样做是由于上面链接的问题中讨论的问题。

Your choices

  • order from base
  • arrange from dplyr
  • setorder and setorderv from data.table
  • arrange from plyr
  • sort from taRifx
  • orderBy from doBy
  • sortData from Deducer

Most of the time you should use the dplyr or data.table solutions, unless having no-dependencies is important, in which case use base::order.


I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here:
Best way to create generic/method consistency for sort.data.frame?

Therefore, given the data.frame dd, you can sort as follows:

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(taRifx)
sort(dd, f= ~ -z + b )

If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: https://chat.stackoverflow.com/transcript/message/1094290#1094290


You can also use the arrange() function from plyr as Hadley pointed out in the above thread:

library(plyr)
arrange(dd,desc(z),b)

Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes sort to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwrites sort.data.frame from Kevin Wright or the taRifx package.

#Load each time
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(microbenchmark)

# Reload R between benchmarks
microbenchmark(dd[with(dd, order(-z, b)), ] ,
    dd[order(-dd$z, dd$b),],
    times=1000
)

Median times:

dd[with(dd, order(-z, b)), ] 778

dd[order(-dd$z, dd$b),] 788

library(taRifx)
microbenchmark(sort(dd, f= ~-z+b ),times=1000)

Median time: 1,567

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=1000)

Median time: 862

library(doBy)
microbenchmark(orderBy(~-z+b, data=dd),times=1000)

Median time: 1,694

Note that doBy takes a good bit of time to load the package.

library(Deducer)
microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)

Couldn't make Deducer load. Needs JGR console.

esort <- function(x, sortvar, ...) {
attach(x)
x <- x[with(x,order(sortvar,...)),]
return(x)
detach(x)
}

microbenchmark(esort(dd, -z, b),times=1000)

Doesn't appear to be compatible with microbenchmark due to the attach/detach.


m <- microbenchmark(
  arrange(dd,desc(z),b),
  sort(dd, f= ~-z+b ),
  dd[with(dd, order(-z, b)), ] ,
  dd[order(-dd$z, dd$b),],
  times=1000
  )

uq <- function(x) { fivenum(x)[4]}  
lq <- function(x) { fivenum(x)[2]}

y_min <- 0 # min(by(m$time,m$expr,lq))
y_max <- max(by(m$time,m$expr,uq)) * 1.05
  
p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) 
p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))

microbenchmark plot

(lines extend from lower quartile to upper quartile, dot is the median)


Given these results and weighing simplicity vs. speed, I'd have to give the nod to arrange in the plyr package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called by sort(object), but I understand why Hadley did it that way due to issues discussed in the question linked above.

め七分饶幸 2024-08-10 02:29:15

德克的回答很棒。它还强调了用于索引 data.framedata.table 的语法中的一个关键区别:

## The data.frame way
dd[with(dd, order(-z, b)), ]

## The data.table way: (7 fewer characters, but that's not the important bit)
dd[order(-z, b)]

这两个调用之间的差异很小,但可能具有重要意义。结果。特别是如果您编写生产代码和/或关心研究的正确性,最好避免不必要的变量名称重复。 数据表
帮助你做到这一点。

下面是一个例子,说明重复变量名可能会给您带来麻烦:

让我们改变 Dirk 答案的上下文,并说这是一个更大项目的一部分,其中有很多对象名称,而且它们又长又有意义;它不是 dd,而是称为 quarterlyreport。它变成:

quarterlyreport[with(quarterlyreport,order(-z,b)),]

好的,很好。这没什么问题。接下来,您的老板要求您将上一季度的报告包含在报告中。你检查你的代码,在不同的地方添加一个对象lastquarterlyreport,不知何故(到底是怎么回事?)你最终得到了这个:

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

这不是你的意思,但你没有发现它,因为你做得很快,而且它位于类似代码的页面上。代码不会失败(没有警告也没有错误),因为 R 认为这就是你的意思。你希望读你报告的人能发现它,但也许他们没有。如果您经常使用编程语言,那么这种情况可能会很熟悉。你会说这是一个“错字”。我会改正你对老板说的“打字错误”。

data.table 中,我们关注微小的细节像这样。因此,我们做了一些简单的事情来避免输入两次变量名称。非常简单的事情。 i 已在 dd 框架内自动评估。您根本不需要 with()

而不是

dd[with(dd, order(-z, b)), ]

它只是

dd[order(-z, b)]

而不是

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

它只是

quarterlyreport[order(-z,b)]

这是一个非常小的差异,但有一天它可能会拯救你的脖子。在权衡这个问题的不同答案时,请考虑将变量名称的重复次数作为决定的标准之一。有些答案有很多重复,有些则没有。

Dirk's answer is great. It also highlights a key difference in the syntax used for indexing data.frames and data.tables:

## The data.frame way
dd[with(dd, order(-z, b)), ]

## The data.table way: (7 fewer characters, but that's not the important bit)
dd[order(-z, b)]

The difference between the two calls is small, but it can have important consequences. Especially if you write production code and/or are concerned with correctness in your research, it's best to avoid unnecessary repetition of variable names. data.table
helps you do this.

Here's an example of how repetition of variable names might get you into trouble:

Let's change the context from Dirk's answer, and say this is part of a bigger project where there are a lot of object names and they are long and meaningful; instead of dd it's called quarterlyreport. It becomes :

quarterlyreport[with(quarterlyreport,order(-z,b)),]

Ok, fine. Nothing wrong with that. Next your boss asks you to include last quarter's report in the report. You go through your code, adding an object lastquarterlyreport in various places and somehow (how on earth?) you end up with this :

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

That isn't what you meant but you didn't spot it because you did it fast and it's nestled on a page of similar code. The code doesn't fall over (no warning and no error) because R thinks it is what you meant. You'd hope whoever reads your report spots it, but maybe they don't. If you work with programming languages a lot then this situation may be all to familiar. It was a "typo" you'll say. I'll fix the "typo" you'll say to your boss.

In data.table we're concerned about tiny details like this. So we've done something simple to avoid typing variable names twice. Something very simple. i is evaluated within the frame of dd already, automatically. You don't need with() at all.

Instead of

dd[with(dd, order(-z, b)), ]

it's just

dd[order(-z, b)]

And instead of

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

it's just

quarterlyreport[order(-z,b)]

It's a very small difference, but it might just save your neck one day. When weighing up the different answers to this question, consider counting the repetitions of variable names as one of your criteria in deciding. Some answers have quite a few repeats, others have none.

旧情勿念 2024-08-10 02:29:15

这里有很多优秀的答案,但是 dplyr 给出了我可以快速轻松记住的唯一语法(因此现在使用非常经常):

library(dplyr)
# sort mtcars by mpg, ascending... use desc(mpg) for descending
arrange(mtcars, mpg)
# sort mtcars first by mpg, then by cyl, then by wt)
arrange(mtcars , mpg, cyl, wt)

对于OP的问题:

arrange(dd, desc(z),  b)

    b x y z
1 Low C 9 2
2 Med D 3 1
3  Hi A 8 1
4  Hi A 9 1

There are a lot of excellent answers here, but dplyr gives the only syntax that I can quickly and easily remember (and so now use very often):

library(dplyr)
# sort mtcars by mpg, ascending... use desc(mpg) for descending
arrange(mtcars, mpg)
# sort mtcars first by mpg, then by cyl, then by wt)
arrange(mtcars , mpg, cyl, wt)

For the OP's problem:

arrange(dd, desc(z),  b)

    b x y z
1 Low C 9 2
2 Med D 3 1
3  Hi A 8 1
4  Hi A 9 1
骄傲 2024-08-10 02:29:15

R 包 data.table 通过简单的语法提供了 data.tables快速内存高效排序(马特在他的回答中很好地强调了其中的一部分)。从那时起,已经有了相当多的改进,并且还增加了一个新函数 setorder()。从 v1.9.5+ 开始,setorder() 也适用于 data.frames

首先,我们将创建一个足够大的数据集,并对其他答案中提到的不同方法进行基准测试,然后列出 data.table 的功能。

数据:

require(plyr)
require(doBy)
require(data.table)
require(dplyr)
require(taRifx)

set.seed(45L)
dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)),
                 x = sample(c("A", "D", "C"), 1e8, TRUE),
                 y = sample(100, 1e8, TRUE),
                 z = sample(5, 1e8, TRUE), 
                 stringsAsFactors = FALSE)

基准:

报告的计时来自对如下所示的这些函数运行 system.time(...)。时间如下表所示(按从最慢到最快的顺序)。

orderBy( ~ -z + b, data = dat)     ## doBy
plyr::arrange(dat, desc(z), b)     ## plyr
arrange(dat, desc(z), b)           ## dplyr
sort(dat, f = ~ -z + b)            ## taRifx
dat[with(dat, order(-z, b)), ]     ## base R

# convert to data.table, by reference
setDT(dat)

dat[order(-z, b)]                  ## data.table, base R like syntax
setorder(dat, -z, b)               ## data.table, using setorder()
                                   ## setorder() now also works with data.frames 

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------
  • data.tableDT[order(...)] 语法比其他最快的方法快 ~10 倍dplyr),同时消耗与 dplyr 相同的内存量。

  • data.tablesetorder() 比其他最快的方法 (dplyr~14 倍 >),同时仅需要 0.4GB 额外内存dat 现在符合我们要求的顺序(因为它是通过引用更新的)。

data.table 功能:

速度:

  • data.table 的排序速度非常快,因为它实现了 基数排序

  • 语法DT[order(...)]在内部进行了优化,以使用data.table的快速排序。您可以继续使用熟悉的基本 R 语法,但加快进程(并使用更少的内存)。

内存:

  • 大多数时候,重新排序后我们不需要原始的data.framedata.table。即我们通常将结果赋值回同一个对象,例如:

    DF <- DF[顺序(...)]
    

    问题是这至少需要原始对象的两倍 (2x) 内存。为了提高内存效率data.table因此还提供了一个函数setorder()

    setorder()data.tables 重新排序通过引用就地),无需进行任何额外操作副本。它仅使用等于一列大小的额外内存。

其他功能:

  1. 它支持整数逻辑数字字符甚至 bit64::integer64 类型。

    <块引用>

    请注意,factorDatePOSIXct 等。类都是integer/numeric 类型下面带有附加属性,因此也受支持。

  2. 在基本 R 中,我们不能在字符向量上使用 - 来按该列降序排序。相反,我们必须使用 -xtfrm(.)

    但是,在 data.table 中,我们可以这样做,例如 dat[order(-x)]setorder(dat, -x )

The R package data.table provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new function setorder() since then. From v1.9.5+, setorder() also works with data.frames.

First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.

Data:

require(plyr)
require(doBy)
require(data.table)
require(dplyr)
require(taRifx)

set.seed(45L)
dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)),
                 x = sample(c("A", "D", "C"), 1e8, TRUE),
                 y = sample(100, 1e8, TRUE),
                 z = sample(5, 1e8, TRUE), 
                 stringsAsFactors = FALSE)

Benchmarks:

The timings reported are from running system.time(...) on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).

orderBy( ~ -z + b, data = dat)     ## doBy
plyr::arrange(dat, desc(z), b)     ## plyr
arrange(dat, desc(z), b)           ## dplyr
sort(dat, f = ~ -z + b)            ## taRifx
dat[with(dat, order(-z, b)), ]     ## base R

# convert to data.table, by reference
setDT(dat)

dat[order(-z, b)]                  ## data.table, base R like syntax
setorder(dat, -z, b)               ## data.table, using setorder()
                                   ## setorder() now also works with data.frames 

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------
  • data.table's DT[order(...)] syntax was ~10x faster than the fastest of other methods (dplyr), while consuming the same amount of memory as dplyr.

  • data.table's setorder() was ~14x faster than the fastest of other methods (dplyr), while taking just 0.4GB extra memory. dat is now in the order we require (as it is updated by reference).

data.table features:

Speed:

  • data.table's ordering is extremely fast because it implements radix ordering.

  • The syntax DT[order(...)] is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).

Memory:

  • Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:

    DF <- DF[order(...)]
    

    The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function setorder().

    setorder() reorders data.tables by reference (in-place), without making any additional copies. It only uses extra memory equal to the size of one column.

Other features:

  1. It supports integer, logical, numeric, character and even bit64::integer64 types.

    Note that factor, Date, POSIXct etc.. classes are all integer/numeric types underneath with additional attributes and are therefore supported as well.

  2. In base R, we can not use - on a character vector to sort by that column in decreasing order. Instead we have to use -xtfrm(.).

    However, in data.table, we can just do, for example, dat[order(-x)] or setorder(dat, -x).

[浮城] 2024-08-10 02:29:15

Kevin Wright 的这个(非常有用的)函数,发布在 R wiki 的提示部分,这很容易实现。

sort(dd,by = ~ -z + b)
#     b x y z
# 4 Low C 9 2
# 2 Med D 3 1
# 1  Hi A 8 1
# 3  Hi A 9 1

With this (very helpful) function by Kevin Wright, posted in the tips section of the R wiki, this is easily achieved.

sort(dd,by = ~ -z + b)
#     b x y z
# 4 Low C 9 2
# 2 Med D 3 1
# 1  Hi A 8 1
# 3  Hi A 9 1
念﹏祤嫣 2024-08-10 02:29:15

假设您有一个 data.frame A 并且您希望使用名为 x 的列降序对其进行排序。调用排序后的 data.frame newdata

newdata <- A[order(-A$x),]

如果您想要升序,则将 "-" 替换为空。您可以有类似的内容,

newdata <- A[order(-A$x, A$y, -A$z),]

其中 xzdata.frame A 中的某些列。这意味着按 x 降序、y 升序和 zdata.frame A 进行排序下降。

Suppose you have a data.frame A and you want to sort it using column called x descending order. Call the sorted data.frame newdata

newdata <- A[order(-A$x),]

If you want ascending order then replace "-" with nothing. You can have something like

newdata <- A[order(-A$x, A$y, -A$z),]

where x and z are some columns in data.frame A. This means sort data.frame A by x descending, y ascending and z descending.

勿挽旧人 2024-08-10 02:29:15

或者你可以使用包 doBy

library(doBy)
dd <- orderBy(~-z+b, data=dd)

or you can use package doBy

library(doBy)
dd <- orderBy(~-z+b, data=dd)
执着的年纪 2024-08-10 02:29:15

如果 SQL 对您来说很自然,sqldf 包会按照 Codd 的意图处理 ORDER BY

if SQL comes naturally to you, sqldf package handles ORDER BY as Codd intended.

我不在是我 2024-08-10 02:29:15

或者,使用 Deducer 包

library(Deducer)
dd<- sortData(dd,c("z","b"),increasing= c(FALSE,TRUE))

Alternatively, using the package Deducer

library(Deducer)
dd<- sortData(dd,c("z","b"),increasing= c(FALSE,TRUE))
〆凄凉。 2024-08-10 02:29:15

响应OP中添加的有关如何以编程方式排序的注释:

使用dplyrdata.table

library(dplyr)
library(data.table)

dplyr

只需使用arrange_,这是arrange 的标准评估版本。

df1 <- tbl_df(iris)
#using strings or formula
arrange_(df1, c('Petal.Length', 'Petal.Width'))
arrange_(df1, ~Petal.Length, ~Petal.Width)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.4         3.9          1.3         0.4  setosa
7           5.5         3.5          1.3         0.2  setosa
8           4.4         3.0          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...


#Or using a variable
sortBy <- c('Petal.Length', 'Petal.Width')
arrange_(df1, .dots = sortBy)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.5         3.5          1.3         0.2  setosa
7           4.4         3.0          1.3         0.2  setosa
8           4.4         3.2          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...

#Doing the same operation except sorting Petal.Length in descending order
sortByDesc <- c('desc(Petal.Length)', 'Petal.Width')
arrange_(df1, .dots = sortByDesc)

更多信息请参见:https://cran.r-project.org/web/packages/ dplyr/vignettes/nse.html

最好使用公式,因为它还捕获环境以评估

data.table中的表达式

dt1 <- data.table(iris) #not really required, as you can work directly on your data.frame
sortBy <- c('Petal.Length', 'Petal.Width')
sortType <- c(-1, 1)
setorderv(dt1, sortBy, sortType)
dt1
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          7.7         2.6          6.9         2.3 virginica
  2:          7.7         2.8          6.7         2.0 virginica
  3:          7.7         3.8          6.7         2.2 virginica
  4:          7.6         3.0          6.6         2.1 virginica
  5:          7.9         3.8          6.4         2.0 virginica
 ---                                                            
146:          5.4         3.9          1.3         0.4    setosa
147:          5.8         4.0          1.2         0.2    setosa
148:          5.0         3.2          1.2         0.2    setosa
149:          4.3         3.0          1.1         0.1    setosa
150:          4.6         3.6          1.0         0.2    setosa

In response to a comment added in the OP for how to sort programmatically:

Using dplyr and data.table

library(dplyr)
library(data.table)

dplyr

Just use arrange_, which is the Standard Evaluation version for arrange.

df1 <- tbl_df(iris)
#using strings or formula
arrange_(df1, c('Petal.Length', 'Petal.Width'))
arrange_(df1, ~Petal.Length, ~Petal.Width)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.4         3.9          1.3         0.4  setosa
7           5.5         3.5          1.3         0.2  setosa
8           4.4         3.0          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...


#Or using a variable
sortBy <- c('Petal.Length', 'Petal.Width')
arrange_(df1, .dots = sortBy)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.5         3.5          1.3         0.2  setosa
7           4.4         3.0          1.3         0.2  setosa
8           4.4         3.2          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...

#Doing the same operation except sorting Petal.Length in descending order
sortByDesc <- c('desc(Petal.Length)', 'Petal.Width')
arrange_(df1, .dots = sortByDesc)

more info here: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html

It is better to use formula as it also captures the environment to evaluate an expression in

data.table

dt1 <- data.table(iris) #not really required, as you can work directly on your data.frame
sortBy <- c('Petal.Length', 'Petal.Width')
sortType <- c(-1, 1)
setorderv(dt1, sortBy, sortType)
dt1
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          7.7         2.6          6.9         2.3 virginica
  2:          7.7         2.8          6.7         2.0 virginica
  3:          7.7         3.8          6.7         2.2 virginica
  4:          7.6         3.0          6.6         2.1 virginica
  5:          7.9         3.8          6.4         2.0 virginica
 ---                                                            
146:          5.4         3.9          1.3         0.4    setosa
147:          5.8         4.0          1.2         0.2    setosa
148:          5.0         3.2          1.2         0.2    setosa
149:          4.3         3.0          1.1         0.1    setosa
150:          4.6         3.6          1.0         0.2    setosa
々眼睛长脚气 2024-08-10 02:29:15

dplyr 中的range() 是我最喜欢的选项。使用管道运算符并从最不重要的方面到最重要的方面

dd1 <- dd %>%
    arrange(z) %>%
    arrange(desc(x))

The arrange() in dplyr is my favorite option. Use the pipe operator and go from least important to most important aspect

dd1 <- dd %>%
    arrange(z) %>%
    arrange(desc(x))
烟花肆意 2024-08-10 02:29:15

我通过以下示例了解了 order,这让我困惑了很长一段时间:

set.seed(1234)

ID        = 1:10
Age       = round(rnorm(10, 50, 1))
diag      = c("Depression", "Bipolar")
Diagnosis = sample(diag, 10, replace=TRUE)

data = data.frame(ID, Age, Diagnosis)

databyAge = data[order(Age),]
databyAge

这个示例有效的唯一原因是 order 是按 向量 Age< 排序的/code>,而不是数据框数据中名为Age的列。

要查看这一点,请使用 read.table 创建一个相同的数据框,列名称略有不同,并且不使用任何上述向量:

my.data <- read.table(text = '

  id age  diagnosis
   1  49 Depression
   2  50 Depression
   3  51 Depression
   4  48 Depression
   5  50 Depression
   6  51    Bipolar
   7  49    Bipolar
   8  49    Bipolar
   9  49    Bipolar
  10  49 Depression

', header = TRUE)

上面的 order 行结构不再之所以有效,是因为没有名为 age 的向量:

databyage = my.data[order(age),]

以下行之所以有效,是因为 ordermy.dataage 列进行排序代码>.

databyage = my.data[order(my.data$age),]

鉴于我长期以来对这个例子感到困惑,我认为这篇文章值得发布。如果这篇文章不适合该主题,我可以将其删除。

编辑:2014 年 5 月 13 日

下面是按每列对数据框进行排序而不指定列名称的通用方法。下面的代码展示了如何从左到右或从右到左排序。如果每列都是数字,则此方法有效。我还没有尝试添加字符列。

一两个月前,我在另一个网站上的一篇旧帖子中找到了 do.call 代码,但只是经过广泛而困难的搜索。我不确定现在是否可以重新定位该帖子。当前线程是在 R 中订购 data.frame 的第一个命中。因此,我认为原始 do.call 代码的扩展版本可能有用。

set.seed(1234)

v1  <- c(0,0,0,0, 0,0,0,0, 1,1,1,1, 1,1,1,1)
v2  <- c(0,0,0,0, 1,1,1,1, 0,0,0,0, 1,1,1,1)
v3  <- c(0,0,1,1, 0,0,1,1, 0,0,1,1, 0,0,1,1)
v4  <- c(0,1,0,1, 0,1,0,1, 0,1,0,1, 0,1,0,1)

df.1 <- data.frame(v1, v2, v3, v4) 
df.1

rdf.1 <- df.1[sample(nrow(df.1), nrow(df.1), replace = FALSE),]
rdf.1

order.rdf.1 <- rdf.1[do.call(order, as.list(rdf.1)),]
order.rdf.1

order.rdf.2 <- rdf.1[do.call(order, rev(as.list(rdf.1))),]
order.rdf.2

rdf.3 <- data.frame(rdf.1$v2, rdf.1$v4, rdf.1$v3, rdf.1$v1) 
rdf.3

order.rdf.3 <- rdf.1[do.call(order, as.list(rdf.3)),]
order.rdf.3

I learned about order with the following example which then confused me for a long time:

set.seed(1234)

ID        = 1:10
Age       = round(rnorm(10, 50, 1))
diag      = c("Depression", "Bipolar")
Diagnosis = sample(diag, 10, replace=TRUE)

data = data.frame(ID, Age, Diagnosis)

databyAge = data[order(Age),]
databyAge

The only reason this example works is because order is sorting by the vector Age, not by the column named Age in the data frame data.

To see this create an identical data frame using read.table with slightly different column names and without making use of any of the above vectors:

my.data <- read.table(text = '

  id age  diagnosis
   1  49 Depression
   2  50 Depression
   3  51 Depression
   4  48 Depression
   5  50 Depression
   6  51    Bipolar
   7  49    Bipolar
   8  49    Bipolar
   9  49    Bipolar
  10  49 Depression

', header = TRUE)

The above line structure for order no longer works because there is no vector named age:

databyage = my.data[order(age),]

The following line works because order sorts on the column age in my.data.

databyage = my.data[order(my.data$age),]

I thought this was worth posting given how confused I was by this example for so long. If this post is not deemed appropriate for the thread I can remove it.

EDIT: May 13, 2014

Below is a generalized way of sorting a data frame by every column without specifying column names. The code below shows how to sort from left to right or by right to left. This works if every column is numeric. I have not tried with a character column added.

I found the do.call code a month or two ago in an old post on a different site, but only after extensive and difficult searching. I am not sure I could relocate that post now. The present thread is the first hit for ordering a data.frame in R. So, I thought my expanded version of that original do.call code might be useful.

set.seed(1234)

v1  <- c(0,0,0,0, 0,0,0,0, 1,1,1,1, 1,1,1,1)
v2  <- c(0,0,0,0, 1,1,1,1, 0,0,0,0, 1,1,1,1)
v3  <- c(0,0,1,1, 0,0,1,1, 0,0,1,1, 0,0,1,1)
v4  <- c(0,1,0,1, 0,1,0,1, 0,1,0,1, 0,1,0,1)

df.1 <- data.frame(v1, v2, v3, v4) 
df.1

rdf.1 <- df.1[sample(nrow(df.1), nrow(df.1), replace = FALSE),]
rdf.1

order.rdf.1 <- rdf.1[do.call(order, as.list(rdf.1)),]
order.rdf.1

order.rdf.2 <- rdf.1[do.call(order, rev(as.list(rdf.1))),]
order.rdf.2

rdf.3 <- data.frame(rdf.1$v2, rdf.1$v4, rdf.1$v3, rdf.1$v1) 
rdf.3

order.rdf.3 <- rdf.1[do.call(order, as.list(rdf.3)),]
order.rdf.3
小巷里的女流氓 2024-08-10 02:29:15

德克的答案很好,但如果您需要保留排序,您将需要将排序应用回该数据框的名称。使用示例代码:

dd <- dd[with(dd, order(-z, b)), ] 

Dirk's answer is good but if you need the sort to persist you'll want to apply the sort back onto the name of that data frame. Using the example code:

dd <- dd[with(dd, order(-z, b)), ] 
骷髅 2024-08-10 02:29:15

只是为了完整起见,因为关于按列号排序的内容并没有太多讨论......可以肯定的是,这通常是不可取的(因为列的顺序可能会改变,为错误铺平道路),但是在某些特定情况下(例如,当您需要快速完成工作并且不存在列更改顺序的风险时),这可能是最明智的做法,特别是在处理大量列时。

在这种情况下,do.call() 就可以派上用场:

ind <- do.call(what = "order", args = iris[,c(5,1,2,3)])
iris[ind, ]

##        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##    14           4.3         3.0          1.1         0.1     setosa
##    9            4.4         2.9          1.4         0.2     setosa
##    39           4.4         3.0          1.3         0.2     setosa
##    43           4.4         3.2          1.3         0.2     setosa
##    42           4.5         2.3          1.3         0.3     setosa
##    4            4.6         3.1          1.5         0.2     setosa
##    48           4.6         3.2          1.4         0.2     setosa
##    7            4.6         3.4          1.4         0.3     setosa
##    (...)

Just for the sake of completeness, since not much has been said about sorting by column numbers... It can surely be argued that it is often not desirable (because the order of the columns could change, paving the way to errors), but in some specific situations (when for instance you need a quick job done and there is no such risk of columns changing orders), it might be the most sensible thing to do, especially when dealing with large numbers of columns.

In that case, do.call() comes to the rescue:

ind <- do.call(what = "order", args = iris[,c(5,1,2,3)])
iris[ind, ]

##        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##    14           4.3         3.0          1.1         0.1     setosa
##    9            4.4         2.9          1.4         0.2     setosa
##    39           4.4         3.0          1.3         0.2     setosa
##    43           4.4         3.2          1.3         0.2     setosa
##    42           4.5         2.3          1.3         0.3     setosa
##    4            4.6         3.1          1.5         0.2     setosa
##    48           4.6         3.2          1.4         0.2     setosa
##    7            4.6         3.4          1.4         0.3     setosa
##    (...)
你是暖光i 2024-08-10 02:29:15

为了完整起见:您还可以使用 BBmisc 包中的 sortByCol() 函数:

library(BBmisc)
sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE))
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

性能比较:

library(microbenchmark)
microbenchmark(sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)), times = 100000)
median 202.878

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=100000)
median 148.758

microbenchmark(dd[with(dd, order(-z, b)), ], times = 100000)
median 115.872

For the sake of completeness: you can also use the sortByCol() function from the BBmisc package:

library(BBmisc)
sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE))
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Performance comparison:

library(microbenchmark)
microbenchmark(sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)), times = 100000)
median 202.878

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=100000)
median 148.758

microbenchmark(dd[with(dd, order(-z, b)), ], times = 100000)
median 115.872
謌踐踏愛綪 2024-08-10 02:29:15

就像很久以前的机械卡片分类器一样,首先按最不重要的键排序,然后是下一个最重要的键,等等。不需要库,可以使用任意数量的键以及升序和降序键的任意组合。

 dd <- dd[order(dd$b, decreasing = FALSE),]

现在我们准备好做最重要的关键了。排序是稳定的,并且最重要的键中的任何关系都已经解决。

dd <- dd[order(dd$z, decreasing = TRUE),]

这可能不是最快的,但它肯定是简单可靠的

Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.

 dd <- dd[order(dd$b, decreasing = FALSE),]

Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.

dd <- dd[order(dd$z, decreasing = TRUE),]

This may not be the fastest, but it is certainly simple and reliable

调妓 2024-08-10 02:29:15

另一种选择是使用 rgr 包:

> library(rgr)
> gx.sort.df(dd, ~ -z+b)
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Another alternative, using the rgr package:

> library(rgr)
> gx.sort.df(dd, ~ -z+b)
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
柳絮泡泡 2024-08-10 02:29:15

当我想要自动化 n 列的排序过程时,我正在努力解决上述解决方案,其中的列名称每次都可能不同。我从 psych 包中找到了一个超级有用的函数,可以以简单的方式执行此操作:

dfOrder(myDf, columnIndices)

其中,columnIndices 是一列或多列的索引,按照您想要的顺序排列对它们进行排序。更多信息请参见:

“psych”包中的 dfOrder 函数

I was struggling with the above solutions when I wanted to automate my ordering process for n columns, whose column names could be different each time. I found a super helpful function from the psych package to do this in a straightforward manner:

dfOrder(myDf, columnIndices)

where columnIndices are indices of one or more columns, in the order in which you want to sort them. More information here:

dfOrder function from 'psych' package

°如果伤别离去 2024-08-10 02:29:15

为了更加完整,R 4.4.0(请参阅此处)现在包含函数 sort_by() (因此具有不需要外部包的优点):

新的通用函数 sort_by(),主要用于 data.frame
可用于按一个或多个对数据框的行进行排序的方法
列。

dd |>
  sort_by(~ list(-z, b)) 
#     b x y z
# 4 Low C 9 2
# 2 Med D 3 1
# 1  Hi A 8 1
# 3  Hi A 9 1

或者:

 sort_by(dd, list(-dd$z, dd$b))

For even more completeness, R 4.4.0 (see here) now includes the function sort_by() (so has the advantage of not needing an external package):

New generic function sort_by(), primarily useful for the data.frame
method which can be used to sort rows of a data frame by one or more
columns.

dd |>
  sort_by(~ list(-z, b)) 
#     b x y z
# 4 Low C 9 2
# 2 Med D 3 1
# 1  Hi A 8 1
# 3  Hi A 9 1

Or:

 sort_by(dd, list(-dd$z, dd$b))
终止放荡 2024-08-10 02:29:15

为了完整起见,{collapse} 提供了一个名为 roworder 的快速函数,它还处理相当多的可选参数。

> collapse::roworder(dd, -z, b)
    b x y z
1 Low C 9 2
2 Med D 3 1
3  Hi A 8 1
4  Hi A 9 1

知道

描述
dplyr::arrange 的快速替代品。它返回数据帧的排序副本,除非数据已经排序,在这种情况下不进行复制。此外,可以手动对行重新排序。使用 data.table::setorder 对数据框进行排序而不创建副本。

来自帮助文件 (?roworder)。

For the sake of completeness, {collapse} offers a rapidly fast function named roworder, which also handles quite a few optional arguments.

> collapse::roworder(dd, -z, b)
    b x y z
1 Low C 9 2
2 Med D 3 1
3  Hi A 8 1
4  Hi A 9 1

Be aware of

Description
A fast substitute for dplyr::arrange. It returns a sorted copy of the data frame, unless the data is already sorted in which case no copy is made. In addition, rows can be manually re-ordered. Use data.table::setorder to sort a data frame without creating a copy.

from the help file (?roworder).

帅的被狗咬 2024-08-10 02:29:15

我建议使用 dplyr 中的 arrange

install.packages("dplyr")
library(dplyr)

您需要按“z”(降序)排序,然后按“b”(升序)排序

df <- df %>%
arrange(desc(z), b)

总结一下:行按 z 列降序排序,并且然后,具有相同 z 值的行将再次按 b 列升序排序。

I would recommend using arrange from dplyr

install.packages("dplyr")
library(dplyr)

You would want to sort by 'z' (descending) and then by 'b' (ascending)

df <- df %>%
arrange(desc(z), b)

To summarize: the rows are sorted by the z column in descending order and then rows that have the same value for z, they're again sorted by the b column in ascending order.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文