按名称删除数据框列
我想从数据框中删除许多列。我知道我们可以使用以下命令单独删除它们:
df$x <- NULL
但我希望用更少的命令来完成此操作。
另外,我知道我可以使用整数索引删除列,如下所示:
df <- df[ -c(1, 3:6, 12) ]
但我担心变量的相对位置可能会改变。
鉴于 R 的强大功能,我认为可能有比逐一删除每一列更好的方法。
I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df[ -c(1, 3:6, 12) ]
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(25)
您可以使用简单的名称列表:
或者,您可以制作一个要保留的列表并按名称引用它们:
编辑:
对于那些还不熟悉索引函数的
drop
参数的人来说,如果您想保留一列作为数据框,您可以这样做:drop=TRUE
(或者不提及它)将删除不必要的维度,因此返回一个包含y
列值的向量。You can use a simple list of names :
Or, alternatively, you can make a list of those to keep and refer to them by name :
EDIT :
For those still not acquainted with the
drop
argument of the indexing function, if you want to keep one column as a data frame, you do:drop=TRUE
(or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of columny
.还有
subset
命令,如果您知道想要哪些列,则很有用:@hadley 评论后更新:要删除列 a,c,您可以执行以下操作:
There's also the
subset
command, useful if you know which columns you want:UPDATED after comment by @hadley: To drop columns a,c you could do:
可能是最简单的,或者对于多个变量:
或者如果您正在处理
data.table
(根据 How do you在 data.table 中按名称删除列?):或对于多个变量
is probably easiest, or for multiple variables:
Or if you're dealing with
data.table
s (per How do you delete a column by name in data.table?):or for multiple variables
您可以像这样使用
%in%
:You could use
%in%
like this:列表(NULL)也有效:
list(NULL) also works:
如果您想通过引用删除列并避免与
data.frames
相关的内部复制,那么您可以使用data.table
包和函数:=< /code>
您可以将字符向量名称传递到
:=
运算符的左侧,并将NULL
作为 RHS。如果您想在
[
调用之外将名称预定义为字符向量,请将对象的名称包装在()
或{}
中强制 LHS 在调用范围内进行计算,而不是作为 DT 范围内的名称。您还可以使用
set
,它避免了[.data.table
的开销,并且也适用于data.frames
!< /强>If you want remove the columns by reference and avoid the internal copying associated with
data.frames
then you can use thedata.table
package and the function:=
You can pass a character vector names to the left hand side of the
:=
operator, andNULL
as the RHS.If you want to predefine the names as as character vector outside the call to
[
, wrap the name of the object in()
or{}
to force the LHS to be evaluated in the calling scope not as a name within the scope ofDT
.You can also use
set
, which avoids the overhead of[.data.table
, and also works fordata.frames
!基于 grep() 将返回数字向量这一事实,有一种可能更强大的策略。如果您有一个很长的变量列表,就像我在一个数据集中所做的那样,一些变量以“.A”结尾,其他变量以“.B”结尾,而您只需要以“.A”结尾的变量(以及对于与任一模式都不匹配的所有变量,执行以下操作:
对于当前的情况,使用 Joris Meys 示例,它可能不会那么紧凑,但会是:
There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:
For the case at hand, using Joris Meys example, it might not be as compact, but it would be:
另一个 dplyr 答案。
使用
select(-column)
。如果您的变量具有一些通用的命名结构,您可以尝试
starts_with()
。例如,如果要删除数据框中的变量序列,可以使用
:
。例如,如果您想删除var2
、var3
以及其间的所有变量,则只需留下var1< /代码>:
Another
dplyr
answer.Use
select(-column)
.If your variables have some common naming structure, you might try
starts_with()
. For exampleIf you want to drop a sequence of variables in the data frame, you can use
:
. For example if you wanted to dropvar2
,var3
, and all variables in between, you'd just be left withvar1
:Dplyr 解决方案
我怀疑这会在这里引起太多关注,但是如果您有一个要删除的列列表,并且您想在
dplyr
链中执行此操作我在select
子句中使用one_of()
:这是一个简单的、可重现的示例:
可以通过运行
?one_of
或此处找到文档:http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
Dplyr Solution
I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a
dplyr
chain I useone_of()
in theselect
clause:Here is a simple, reproducable example:
Documentation can be found by running
?one_of
or here:http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
另一种可能性:
或者
Another possibility:
or
出于兴趣,这标志着 R 奇怪的多重语法不一致之一。例如,给定一个两列数据框:
这给出了一个数据框
,但这给出了一个向量
这在
?[
中都有解释,但这并不是完全预期的行为。好吧,至少对我来说不是……Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:
This gives a data frame
but this gives a vector
This is all explained in
?[
but it's not exactly expected behaviour. Well at least not to me...输出:
输出:
Output:
Output:
这是一种 dplyr 方法:
我喜欢这个,因为它易于阅读和使用。无需注释即可理解,并且对数据框中位置变化的列具有鲁棒性。它还遵循使用
-
删除元素的矢量化习惯用法。Here is a
dplyr
way to go about it:I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using
-
to remove elements.我一直认为一定有更好的习惯用法,但是为了按名称减去列,我倾向于执行以下操作:
I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:
Bernd Bischl 的
BBmisc
包中有一个名为dropNamed()
的函数正是执行此操作。优点是它避免了重复数据帧参数,因此适合在 magrittr 中进行管道传输(就像 dplyr 方法一样):
There's a function called
dropNamed()
in Bernd Bischl'sBBmisc
package that does exactly this.The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in
magrittr
(just like thedplyr
approaches):除了早期答案中演示的
select(-one_of(drop_col_names))
之外,还有其他几个dplyr
选项用于使用select()
删除列,不涉及定义所有特定的列名称(使用 dplyr starwars 示例数据来表示某些列名称):如果您需要删除数据框中可能存在或不存在的列,请使用
select_if()
与使用one_of()
不同,如果列名不存在,则不会抛出Unknown columns:
警告。在此示例中,“bad_column”不是数据框中的列:Beyond
select(-one_of(drop_col_names))
demonstrated in earlier answers, there are a couple otherdplyr
options for dropping columns usingselect()
that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):If you need to drop a column that may or may not exist in the data frame, here's a slight twist using
select_if()
that unlike usingone_of()
will not throw anUnknown columns:
warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:如果您不想使用上面的@hadley,还有另一个解决方案:如果“COLUMN_NAME”是您要删除的列的名称:
Another solution if you don't want to use @hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:
提供数据框和要删除的逗号分隔名称字符串:
用法:
Provide the data frame and a string of comma separated names to remove:
Usage:
您可以采用多种方法...
选项 1:
选项 2:
选项 3:
There are a lot of ways you can do...
Option-1:
Option-2:
Option-3:
按数据框中的列名称删除和删除列。
Drop and delete columns by columns name in data frame.
使用
which
查找要删除的列的索引。给这些索引加上负号 (*-1
)。然后对这些值进行子集化,这会将它们从数据框中删除。这是一个例子。Find the index of the columns you want to drop using
which
. Give these indexes a negative sign (*-1
). Then subset on those values, which will remove them from the dataframe. This is an example.如果您有一个很大的
data.frame
并且内存不足,请使用[
。 。 。 。 或rm
和within
来删除data.frame
的列,作为子集< /code> 当前 (R 3.6.2) 使用更多内存 - 除了手册中交互使用
子集
的提示。If you have a large
data.frame
and are low on memory use[
. . . . orrm
andwithin
to remove columns of adata.frame
, assubset
is currently (R 3.6.2) using more memory - beside the hint of the manual to usesubset
interactively.另一种选择是使用 collapse 包中的函数
fselect
。以下是一个可重现的示例:创建于 2022 年 8 月 26 日,使用 reprex v2.0.2
Another option using the function
fselect
from the collapse package. Here is a reproducible example:Created on 2022-08-26 with reprex v2.0.2
另一个尚未发布的 data.table 选项是使用特殊动词
.SD
,它代表数据子集。与.SDcols
参数一起,您可以按名称或索引选择/删除列。data.table 中此类任务的所有选项的摘要可以在此处找到
Another data.table option which hasn't been posted yet is using the special verb
.SD
, which stands for subset of data. Together with the.SDcols
argument you can select/drop columns by name or index.A summary of all the options for such a task in data.table can be found here