将函数应用于数据框中的每一列,观察每列现有的数据类型
我正在尝试获取大型数据框中每列的最小值/最大值,作为了解我的数据的一部分。我的第一次尝试是:
apply(t,2,max,na.rm=1)
它将所有内容视为字符向量,因为前几列是字符类型。因此,某些数字列的最大值显示为 " -99.5"
。
然后我尝试了这个:
sapply(t,max,na.rm=1)
但它抱怨 max 对因子没有意义。 (lapply
是一样的。)令我困惑的是,apply
认为 max
对于因子来说是完全有意义的,例如它返回“ZEBRA”第 1 栏
。顺便说一句,我查看了 在向量上使用 sapply POSIXct ,答案之一是“当您使用 sapply 时,您的对象被强制为数字,...”。这就是我身上发生的事情吗?如果是这样,是否有一个不强制的替代应用函数?当然,这是一种常见的需求,因为数据框类型的关键特征之一是每列可以是不同的类型。
I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5"
.
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply
is the same.) What is confusing me is that apply
thought max
was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果它是一个“有序因素”,事情就会有所不同。这并不是说我喜欢“有序因素”,我不喜欢,只是说有些关系是为“有序因素”定义的,而不是为“因素”定义的。因素被认为是普通的分类变量。您将看到因素的自然排序顺序,即您所在区域的字母词汇顺序。如果您想为每一列自动强制转换为“数字”,...日期和因素等等,请尝试:
或者如果您想首先测试因素并按预期返回,那么:
@Darrens 评论确实效果更好:
max
在字符向量方面确实成功。If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
Or if you want to test for factors first and return as you expect then:
@Darrens comment does work better:
max
does succeed with character vectors.max
与apply
一起使用的原因是apply
首先将数据帧强制转换为矩阵,而矩阵只能保存一种数据类型。所以你最终得到一个字符矩阵。sapply
只是lapply
的包装器,因此两者产生相同的错误也就不足为奇了。创建数据框时的默认行为是将分类列存储为因子。除非您指定它是一个有序因子,否则像
max
和min
这样的操作将是未定义的,因为 R 假设您已经创建了一个无序因子。您可以通过指定
options(stringsAsFactors = FALSE)
来更改此行为,这将更改整个会话的默认值,或者您可以在中传递
构造调用本身。请注意,这仅意味着stringsAsFactors = FALSE
data.frame()min
和max
将默认采用“字母顺序”排序。或者您可以手动指定每个因素的顺序,尽管我怀疑这就是您想要做的。
无论如何,
sapply
通常会产生一个原子向量,在许多情况下这需要将所有内容转换为字符。解决这个问题的一种方法如下:The reason that
max
works withapply
is thatapply
is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters.sapply
is just a wrapper forlapply
, so it is not surprising that both yield the same error.The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like
max
andmin
will be undefined, since R is assuming that you've created an unordered factor.You can change this behavior by specifying
options(stringsAsFactors = FALSE)
, which will change the default for the entire session, or you can passstringsAsFactors = FALSE
in thedata.frame()
construction call itself. Note that this just means thatmin
andmax
will assume "alphabetical" ordering by default.Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless,
sapply
will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:如果您想了解数据,
summary (df)
提供数值列的最小值、第一分位数、中位数和平均值、第三分位数和最大值以及因子列顶级的频率。If you want to learn your data
summary (df)
provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.最好的方法是避免基本的
*apply
函数,它将整个数据帧强制为数组,可能会丢失信息。如果您想应用函数
as.numeric
到每一列,一个简单的方法是使用mutate_all< /代码>来自dplyr
:
或者使用
colwise 的 code>,它将“将一个对向量进行操作的函数转换为对 data.frame 进行按列操作的函数”。
在读取字符向量数据表并将列强制转换为正确数据类型的特殊情况下,请使用 readr 中的
type.convert
或type_convert
。不太有趣的答案:我们可以使用 for 循环应用于每一列:
我不知道 用 * 进行赋值的好方法在保留数据帧结构的同时应用。
The best way to do this is avoid base
*apply
functions, which coerces the entire data frame to an array, possibly losing information.If you wanted to apply a function
as.numeric
to every column, a simple way is usingmutate_all
from dplyr:Alternatively use
colwise
from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use
type.convert
ortype_convert
from readr.Less interesting answer: we can apply on each column with a for-loop:
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
基于@ltamar 的答案:
使用摘要并将输出变成有用的东西!
它不漂亮,当然也不快,但它可以完成工作!
building on @ltamar's answer:
Use summary and munge the output into something useful!
It's not pretty and it is certainly not fast but it gets the job done!
如今,循环速度同样快,因此这已经足够了:
these days loops are just as fast so this is more than sufficient:
使用 hablar 的
retype()
的解决方案根据可行性将因子强制为字符或数字类型。我将使用 dplyr 将 max 应用于每列。代码
结果
不是新的列类型。
数据
A solution using
retype()
from hablar to coerce factors to character or numeric type depending on feasability. I'd usedplyr
for applying max to each column.Code
Result
Not the new column types.
Data