按多列对数据框行进行排序(排序)
我想按多列对数据框进行排序。例如,对于下面的数据框,我想按列“z”(降序)排序,然后按列“b”(升序)排序:
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 Hi A 8 1
2 Med D 3 1
3 Hi A 9 1
4 Low C 9 2
I want to sort a data frame by multiple columns. For example, with the data frame below I would like to sort by column 'z' (descending) then by column 'b' (ascending):
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 Hi A 8 1
2 Med D 3 1
3 Hi A 9 1
4 Low C 9 2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(22)
您可以使用
order()< /code>
直接运行,无需借助附加工具 - 请参阅这个更简单的答案,它使用了
example(order)
代码顶部的技巧:编辑一些2年多后:有人问如何通过列索引来做到这一点。答案是简单地将所需的排序列传递给
order()
函数:而不是使用列的名称(和
with()
更容易/更直接的访问)。You can use the
order()
function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of theexample(order)
code:Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the
order()
function:rather than using the name of the column (and
with()
for easier/more direct access).您可以选择
order
来自base
arrange
来自dplyr
setorder
和setorderv< /code> 来自
data.table
arrange
来自plyr
sort
来自taRifx
orderBy
来自doBy
sortData
来自Deducer
大多数时候您应该使用
dplyr
或 < code>data.table 解决方案,除非无依赖项很重要,在这种情况下使用base::order
。我最近将 sort.data.frame 添加到 CRAN 包中,使其类兼容,如下所述:
为排序创建通用/方法一致性的最佳方法.data.frame?
因此,给定data.frame dd,你可以这样排序:
如果你是这个函数的原作者之一,请联系我。关于公共领域的讨论在这里:https://chat.stackoverflow.com/transcript/message/1094290#1094290< /a>
您还可以使用
plyr
中的arrange()
函数,正如 Hadley 在上面的线程中指出的那样:基准:请注意,我在新的 R 会话中加载了每个包因为有很多冲突。特别是,加载 doBy 包会导致
sort
返回“The following object(s) are masked from 'x (position 17)': b, x, y, z”,并且加载 Deducer 包会覆盖来自 Kevin Wright 或 taRifx 包的sort.data.frame
。中位数时间:
dd[with(dd, order(-z, b)), ]
778dd[order(-dd$z, dd$b) ,]
788中位时间:1,567
中位时间:862
中位时间:1,694
请注意,doBy加载包需要很长时间。
无法加载 Deducer。需要 JGR 控制台。
由于附加/分离,似乎与微基准测试不兼容。
(线从下四分位数延伸到上四分位数,点是中位数)
鉴于这些结果并权衡简单性与速度,我必须点头
排列
在plyr
包中。它具有简单的语法,但速度几乎与具有复杂机制的基本 R 命令一样快。哈德利·威克姆 (Hadley Wickham) 的典型杰出作品。我对它唯一的抱怨是它打破了标准的 R 命名法,其中排序对象由sort(object)
调用,但我理解为什么 Hadley 这样做是由于上面链接的问题中讨论的问题。Your choices
order
frombase
arrange
fromdplyr
setorder
andsetorderv
fromdata.table
arrange
fromplyr
sort
fromtaRifx
orderBy
fromdoBy
sortData
fromDeducer
Most of the time you should use the
dplyr
ordata.table
solutions, unless having no-dependencies is important, in which case usebase::order
.I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here:
Best way to create generic/method consistency for sort.data.frame?
Therefore, given the data.frame dd, you can sort as follows:
If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: https://chat.stackoverflow.com/transcript/message/1094290#1094290
You can also use the
arrange()
function fromplyr
as Hadley pointed out in the above thread:Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes
sort
to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwritessort.data.frame
from Kevin Wright or the taRifx package.Median times:
dd[with(dd, order(-z, b)), ]
778dd[order(-dd$z, dd$b),]
788Median time: 1,567
Median time: 862
Median time: 1,694
Note that doBy takes a good bit of time to load the package.
Couldn't make Deducer load. Needs JGR console.
Doesn't appear to be compatible with microbenchmark due to the attach/detach.
(lines extend from lower quartile to upper quartile, dot is the median)
Given these results and weighing simplicity vs. speed, I'd have to give the nod to
arrange
in theplyr
package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called bysort(object)
, but I understand why Hadley did it that way due to issues discussed in the question linked above.德克的回答很棒。它还强调了用于索引
data.frame
和data.table
的语法中的一个关键区别:这两个调用之间的差异很小,但可能具有重要意义。结果。特别是如果您编写生产代码和/或关心研究的正确性,最好避免不必要的变量名称重复。
数据表
帮助你做到这一点。
下面是一个例子,说明重复变量名可能会给您带来麻烦:
让我们改变 Dirk 答案的上下文,并说这是一个更大项目的一部分,其中有很多对象名称,而且它们又长又有意义;它不是
dd
,而是称为quarterlyreport
。它变成:好的,很好。这没什么问题。接下来,您的老板要求您将上一季度的报告包含在报告中。你检查你的代码,在不同的地方添加一个对象
lastquarterlyreport
,不知何故(到底是怎么回事?)你最终得到了这个:这不是你的意思,但你没有发现它,因为你做得很快,而且它位于类似代码的页面上。代码不会失败(没有警告也没有错误),因为 R 认为这就是你的意思。你希望读你报告的人能发现它,但也许他们没有。如果您经常使用编程语言,那么这种情况可能会很熟悉。你会说这是一个“错字”。我会改正你对老板说的“打字错误”。
在
data.table
中,我们关注微小的细节像这样。因此,我们做了一些简单的事情来避免输入两次变量名称。非常简单的事情。i
已在dd
框架内自动评估。您根本不需要with()
。而不是
它只是
而不是
它只是
这是一个非常小的差异,但有一天它可能会拯救你的脖子。在权衡这个问题的不同答案时,请考虑将变量名称的重复次数作为决定的标准之一。有些答案有很多重复,有些则没有。
Dirk's answer is great. It also highlights a key difference in the syntax used for indexing
data.frame
s anddata.table
s:The difference between the two calls is small, but it can have important consequences. Especially if you write production code and/or are concerned with correctness in your research, it's best to avoid unnecessary repetition of variable names.
data.table
helps you do this.
Here's an example of how repetition of variable names might get you into trouble:
Let's change the context from Dirk's answer, and say this is part of a bigger project where there are a lot of object names and they are long and meaningful; instead of
dd
it's calledquarterlyreport
. It becomes :Ok, fine. Nothing wrong with that. Next your boss asks you to include last quarter's report in the report. You go through your code, adding an object
lastquarterlyreport
in various places and somehow (how on earth?) you end up with this :That isn't what you meant but you didn't spot it because you did it fast and it's nestled on a page of similar code. The code doesn't fall over (no warning and no error) because R thinks it is what you meant. You'd hope whoever reads your report spots it, but maybe they don't. If you work with programming languages a lot then this situation may be all to familiar. It was a "typo" you'll say. I'll fix the "typo" you'll say to your boss.
In
data.table
we're concerned about tiny details like this. So we've done something simple to avoid typing variable names twice. Something very simple.i
is evaluated within the frame ofdd
already, automatically. You don't needwith()
at all.Instead of
it's just
And instead of
it's just
It's a very small difference, but it might just save your neck one day. When weighing up the different answers to this question, consider counting the repetitions of variable names as one of your criteria in deciding. Some answers have quite a few repeats, others have none.
这里有很多优秀的答案,但是 dplyr 给出了我可以快速轻松记住的唯一语法(因此现在使用非常经常):
对于OP的问题:
There are a lot of excellent answers here, but dplyr gives the only syntax that I can quickly and easily remember (and so now use very often):
For the OP's problem:
R 包
data.table
通过简单的语法提供了 data.tables 的快速和内存高效排序(马特在他的回答中很好地强调了其中的一部分)。从那时起,已经有了相当多的改进,并且还增加了一个新函数setorder()
。从v1.9.5+
开始,setorder()
也适用于 data.frames。首先,我们将创建一个足够大的数据集,并对其他答案中提到的不同方法进行基准测试,然后列出 data.table 的功能。
数据:
基准:
报告的计时来自对如下所示的这些函数运行
system.time(...)
。时间如下表所示(按从最慢到最快的顺序)。data.table
的DT[order(...)]
语法比其他最快的方法快 ~10 倍(dplyr
),同时消耗与dplyr
相同的内存量。data.table
的setorder()
比其他最快的方法 (dplyr~14 倍 >),同时仅需要 0.4GB 额外内存。
dat
现在符合我们要求的顺序(因为它是通过引用更新的)。data.table 功能:
速度:
data.table 的排序速度非常快,因为它实现了 基数排序。
语法
DT[order(...)]
在内部进行了优化,以使用data.table的快速排序。您可以继续使用熟悉的基本 R 语法,但加快进程(并使用更少的内存)。内存:
大多数时候,重新排序后我们不需要原始的data.frame或data.table。即我们通常将结果赋值回同一个对象,例如:
问题是这至少需要原始对象的两倍 (2x) 内存。为了提高内存效率,data.table因此还提供了一个函数
setorder()
。setorder()
对 data.tables 重新排序通过引用
(就地),无需进行任何额外操作副本。它仅使用等于一列大小的额外内存。其他功能:
它支持
整数
、逻辑
、数字
、字符
甚至bit64::integer64
类型。<块引用>
请注意,
factor
、Date
、POSIXct
等。类都是integer
/numeric
类型下面带有附加属性,因此也受支持。在基本 R 中,我们不能在字符向量上使用
-
来按该列降序排序。相反,我们必须使用-xtfrm(.)
。但是,在 data.table 中,我们可以这样做,例如
dat[order(-x)]
或setorder(dat, -x )
。The R package
data.table
provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new functionsetorder()
since then. Fromv1.9.5+
,setorder()
also works with data.frames.First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.
Data:
Benchmarks:
The timings reported are from running
system.time(...)
on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).data.table
'sDT[order(...)]
syntax was ~10x faster than the fastest of other methods (dplyr
), while consuming the same amount of memory asdplyr
.data.table
'ssetorder()
was ~14x faster than the fastest of other methods (dplyr
), while taking just 0.4GB extra memory.dat
is now in the order we require (as it is updated by reference).data.table features:
Speed:
data.table's ordering is extremely fast because it implements radix ordering.
The syntax
DT[order(...)]
is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).Memory:
Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:
The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function
setorder()
.setorder()
reorders data.tablesby reference
(in-place), without making any additional copies. It only uses extra memory equal to the size of one column.Other features:
It supports
integer
,logical
,numeric
,character
and evenbit64::integer64
types.In base R, we can not use
-
on a character vector to sort by that column in decreasing order. Instead we have to use-xtfrm(.)
.However, in data.table, we can just do, for example,
dat[order(-x)]
orsetorder(dat, -x)
.与 Kevin Wright 的这个(非常有用的)函数,发布在 R wiki 的提示部分,这很容易实现。
With this (very helpful) function by Kevin Wright, posted in the tips section of the R wiki, this is easily achieved.
假设您有一个
data.frame
A
并且您希望使用名为x
的列降序对其进行排序。调用排序后的data.frame
newdata
如果您想要升序,则将
"-"
替换为空。您可以有类似的内容,其中
x
和z
是data.frame
A
中的某些列。这意味着按x
降序、y
升序和z
对data.frame
A
进行排序下降。Suppose you have a
data.frame
A
and you want to sort it using column calledx
descending order. Call the sorteddata.frame
newdata
If you want ascending order then replace
"-"
with nothing. You can have something likewhere
x
andz
are some columns indata.frame
A
. This means sortdata.frame
A
byx
descending,y
ascending andz
descending.或者你可以使用包 doBy
or you can use package doBy
如果 SQL 对您来说很自然,
sqldf
包会按照 Codd 的意图处理ORDER BY
。if SQL comes naturally to you,
sqldf
package handlesORDER BY
as Codd intended.或者,使用 Deducer 包
Alternatively, using the package Deducer
响应OP中添加的有关如何以编程方式排序的注释:
使用
dplyr
和data.table
dplyr
只需使用
arrange_
,这是arrange
的标准评估版本。更多信息请参见:https://cran.r-project.org/web/packages/ dplyr/vignettes/nse.html
最好使用公式,因为它还捕获环境以评估
data.table中的表达式
In response to a comment added in the OP for how to sort programmatically:
Using
dplyr
anddata.table
dplyr
Just use
arrange_
, which is the Standard Evaluation version forarrange
.more info here: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
It is better to use formula as it also captures the environment to evaluate an expression in
data.table
dplyr 中的range() 是我最喜欢的选项。使用管道运算符并从最不重要的方面到最重要的方面
The arrange() in dplyr is my favorite option. Use the pipe operator and go from least important to most important aspect
我通过以下示例了解了
order
,这让我困惑了很长一段时间:这个示例有效的唯一原因是
order
是按向量 Age< 排序的/code>,而不是
数据框数据
中名为Age
的列。要查看这一点,请使用
read.table
创建一个相同的数据框,列名称略有不同,并且不使用任何上述向量:上面的
order
行结构不再之所以有效,是因为没有名为age
的向量:以下行之所以有效,是因为
order
对my.dataage
列进行排序代码>.鉴于我长期以来对这个例子感到困惑,我认为这篇文章值得发布。如果这篇文章不适合该主题,我可以将其删除。
编辑:2014 年 5 月 13 日
下面是按每列对数据框进行排序而不指定列名称的通用方法。下面的代码展示了如何从左到右或从右到左排序。如果每列都是数字,则此方法有效。我还没有尝试添加字符列。
一两个月前,我在另一个网站上的一篇旧帖子中找到了 do.call 代码,但只是经过广泛而困难的搜索。我不确定现在是否可以重新定位该帖子。当前线程是在
R
中订购data.frame
的第一个命中。因此,我认为原始do.call
代码的扩展版本可能有用。I learned about
order
with the following example which then confused me for a long time:The only reason this example works is because
order
is sorting by thevector Age
, not by the column namedAge
in thedata frame data
.To see this create an identical data frame using
read.table
with slightly different column names and without making use of any of the above vectors:The above line structure for
order
no longer works because there is no vector namedage
:The following line works because
order
sorts on the columnage
inmy.data
.I thought this was worth posting given how confused I was by this example for so long. If this post is not deemed appropriate for the thread I can remove it.
EDIT: May 13, 2014
Below is a generalized way of sorting a data frame by every column without specifying column names. The code below shows how to sort from left to right or by right to left. This works if every column is numeric. I have not tried with a character column added.
I found the
do.call
code a month or two ago in an old post on a different site, but only after extensive and difficult searching. I am not sure I could relocate that post now. The present thread is the first hit for ordering adata.frame
inR
. So, I thought my expanded version of that originaldo.call
code might be useful.德克的答案很好,但如果您需要保留排序,您将需要将排序应用回该数据框的名称。使用示例代码:
Dirk's answer is good but if you need the sort to persist you'll want to apply the sort back onto the name of that data frame. Using the example code:
只是为了完整起见,因为关于按列号排序的内容并没有太多讨论......可以肯定的是,这通常是不可取的(因为列的顺序可能会改变,为错误铺平道路),但是在某些特定情况下(例如,当您需要快速完成工作并且不存在列更改顺序的风险时),这可能是最明智的做法,特别是在处理大量列时。
在这种情况下,
do.call()
就可以派上用场:Just for the sake of completeness, since not much has been said about sorting by column numbers... It can surely be argued that it is often not desirable (because the order of the columns could change, paving the way to errors), but in some specific situations (when for instance you need a quick job done and there is no such risk of columns changing orders), it might be the most sensible thing to do, especially when dealing with large numbers of columns.
In that case,
do.call()
comes to the rescue:为了完整起见:您还可以使用
BBmisc
包中的sortByCol()
函数:性能比较:
For the sake of completeness: you can also use the
sortByCol()
function from theBBmisc
package:Performance comparison:
就像很久以前的机械卡片分类器一样,首先按最不重要的键排序,然后是下一个最重要的键,等等。不需要库,可以使用任意数量的键以及升序和降序键的任意组合。
现在我们准备好做最重要的关键了。排序是稳定的,并且最重要的键中的任何关系都已经解决。
这可能不是最快的,但它肯定是简单可靠的
Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.
Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.
This may not be the fastest, but it is certainly simple and reliable
另一种选择是使用 rgr 包:
Another alternative, using the
rgr
package:当我想要自动化 n 列的排序过程时,我正在努力解决上述解决方案,其中的列名称每次都可能不同。我从 psych 包中找到了一个超级有用的函数,可以以简单的方式执行此操作:
其中,columnIndices 是一列或多列的索引,按照您想要的顺序排列对它们进行排序。更多信息请参见:
“psych”包中的 dfOrder 函数
I was struggling with the above solutions when I wanted to automate my ordering process for n columns, whose column names could be different each time. I found a super helpful function from the
psych
package to do this in a straightforward manner:where
columnIndices
are indices of one or more columns, in the order in which you want to sort them. More information here:dfOrder function from 'psych' package
为了更加完整,R 4.4.0(请参阅此处)现在包含函数
sort_by()
(因此具有不需要外部包的优点):或者:
For even more completeness, R 4.4.0 (see here) now includes the function
sort_by()
(so has the advantage of not needing an external package):Or:
为了完整起见,
{collapse}
提供了一个名为roworder
的快速函数,它还处理相当多的可选参数。知道
来自帮助文件 (
?roworder
)。For the sake of completeness,
{collapse}
offers a rapidly fast function namedroworder
, which also handles quite a few optional arguments.Be aware of
from the help file (
?roworder
).我建议使用 dplyr 中的
arrange
您需要按“z”(降序)排序,然后按“b”(升序)排序
总结一下:行按 z 列降序排序,并且然后,具有相同 z 值的行将再次按 b 列升序排序。
I would recommend using
arrange
from dplyrYou would want to sort by 'z' (descending) and then by 'b' (ascending)
To summarize: the rows are sorted by the z column in descending order and then rows that have the same value for z, they're again sorted by the b column in ascending order.