将函数应用于数据框中的每一列,观察每列现有的数据类型

发布于 2024-12-03 10:38:42 字数 646 浏览 1 评论 0原文

我正在尝试获取大型数据框中每列的最小值/最大值,作为了解我的数据的一部分。我的第一次尝试是:

apply(t,2,max,na.rm=1)

它将所有内容视为字符向量,因为前几列是字符类型。因此,某些数字列的最大值显示为 " -99.5"

然后我尝试了这个:

sapply(t,max,na.rm=1)

但它抱怨 max 对因子没有意义。 (lapply 是一样的。)令我困惑的是,apply 认为 max 对于因子来说是完全有意义的,例如它返回“ZEBRA”第 1 栏

。顺便说一句,我查看了 在向量上使用 sapply POSIXct ,答案之一是“当您使用 sapply 时,您的对象被强制为数字,...”。这就是我身上发生的事情吗?如果是这样,是否有一个不强制的替代应用函数?当然,这是一种常见的需求,因为数据框类型的关键特征之一是每列可以是不同的类型。

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:

apply(t,2,max,na.rm=1)

It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".

I then tried this:

sapply(t,max,na.rm=1)

but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.

BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

神回复 2024-12-10 10:38:42

如果它是一个“有序因素”,事情就会有所不同。这并不是说我喜欢“有序因素”,我不喜欢,只是说有些关系是为“有序因素”定义的,而不是为“因素”定义的。因素被认为是普通的分类变量。您将看到因素的自然排序顺序,即您所在区域的字母词汇顺序。如果您想为每一列自动强制转换为“数字”,...日期和因素等等,请尝试:

sapply(df, function(x) max(as.numeric(x)) )   # not generally a useful result

或者如果您想首先测试因素并按预期返回,那么:

sapply( df, function(x) if("factor" %in% class(x) ) { 
            max(as.numeric(as.character(x)))
            } else { max(x) } )

@Darrens 评论确实效果更好:

 sapply(df, function(x) max(as.character(x)) )  

max 在字符向量方面确实成功。

If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:

sapply(df, function(x) max(as.numeric(x)) )   # not generally a useful result

Or if you want to test for factors first and return as you expect then:

sapply( df, function(x) if("factor" %in% class(x) ) { 
            max(as.numeric(as.character(x)))
            } else { max(x) } )

@Darrens comment does work better:

 sapply(df, function(x) max(as.character(x)) )  

max does succeed with character vectors.

深巷少女 2024-12-10 10:38:42

maxapply 一起使用的原因是 apply 首先将数据帧强制转换为矩阵,而矩阵只能保存一种数据类型。所以你最终得到一个字符矩阵。 sapply 只是 lapply 的包装器,因此两者产生相同的错误也就不足为奇了。

创建数据框时的默认行为是将分类列存储为因子。除非您指定它是一个有序因子,否则像 maxmin 这样的操作将是未定义的,因为 R 假设您已经创建了一个无序因子。

您可以通过指定 options(stringsAsFactors = FALSE) 来更改此行为,这将更改整个会话的默认值,或者您可以在 中传递 stringsAsFactors = FALSE data.frame() 构造调用本身。请注意,这仅意味着 minmax 将默认采用“字母顺序”排序。

或者您可以手动指定每个因素的顺序,尽管我怀疑这就是您想要做的。

无论如何,sapply 通常会产生一个原子向量,在许多情况下这需要将所有内容转换为字符。解决这个问题的一种方法如下:

#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)

d[4,] <- NA

#Similar function to DWin's answer          
fun <- function(x){
    if(is.numeric(x)){max(x,na.rm = 1)}
    else{max(as.character(x),na.rm=1)}
}   

#Use colwise from plyr package
colwise(fun)(d)
         v1 v2       v3 v4
1 0.8478983  j 1.999435  J

The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.

The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.

You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.

Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.

Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:

#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)

d[4,] <- NA

#Similar function to DWin's answer          
fun <- function(x){
    if(is.numeric(x)){max(x,na.rm = 1)}
    else{max(as.character(x),na.rm=1)}
}   

#Use colwise from plyr package
colwise(fun)(d)
         v1 v2       v3 v4
1 0.8478983  j 1.999435  J
雄赳赳气昂昂 2024-12-10 10:38:42

如果您想了解数据,summary (df) 提供数值列的最小值、第一分位数、中位数和平均值、第三分位数和最大值以及因子列顶级的频率。

If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.

睫毛上残留的泪 2024-12-10 10:38:42

最好的方法是避免基本的 *apply 函数,它将整个数据帧强制为数组,可能会丢失信息。

如果您想应用函数as.numeric 到每一列,一个简单的方法是使用 mutate_all< /代码>来自dplyr

t %>% mutate_all(as.numeric)

或者使用 colwise 的 code>,它将“将一个对向量进行操作的函数转换为对 data.frame 进行按列操作的函数”。

t %>% (colwise(as.numeric))

在读取字符向量数据表并将列强制转换为正确数据类型的特殊情况下,请使用 readr 中的 type.converttype_convert


不太有趣的答案:我们可以使用 for 循环应用于每一列:

for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }

我不知道 用 * 进行赋值的好方法在保留数据帧结构的同时应用

The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.

If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:

t %>% mutate_all(as.numeric)

Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."

t %>% (colwise(as.numeric))

In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.


Less interesting answer: we can apply on each column with a for-loop:

for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }

I don't know of a good way of doing assignment with *apply while preserving data frame structure.

酸甜透明夹心 2024-12-10 10:38:42

基于@ltamar 的答案:

使用摘要并将输出变成有用的东西!

library(tidyr)
library(dplyr)

df %>% 
  summary %>% 
  data.frame %>%
  select(-Var1) %>%
  separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
  rename(column_name=Var2) %>%
  mutate(value=as.numeric(value),
         metric = trimws(metric,'both') 
  ) %>%  
  filter(!is.na(value)) -> metrics

它不漂亮,当然也不快,但它可以完成工作!

building on @ltamar's answer:

Use summary and munge the output into something useful!

library(tidyr)
library(dplyr)

df %>% 
  summary %>% 
  data.frame %>%
  select(-Var1) %>%
  separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
  rename(column_name=Var2) %>%
  mutate(value=as.numeric(value),
         metric = trimws(metric,'both') 
  ) %>%  
  filter(!is.na(value)) -> metrics

It's not pretty and it is certainly not fast but it gets the job done!

↙温凉少女 2024-12-10 10:38:42

如今,循环速度同样快,因此这已经足够了:

for (I in 1L:length(c(1,2,3))) {
    data.frame(c("1","2","3"),c("1","3","3"))[,I] <- 
    max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}

these days loops are just as fast so this is more than sufficient:

for (I in 1L:length(c(1,2,3))) {
    data.frame(c("1","2","3"),c("1","3","3"))[,I] <- 
    max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
陌上青苔 2024-12-10 10:38:42

使用 hablar 的 retype() 的解决方案根据可行性将因子强制为字符或数字类型。我将使用 dplyr 将 max 应用于每列。

代码

library(dplyr)
library(hablar)

# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()

# Check max for each column
d %>% summarise_all(max)

结果

不是新的列类型。

     v1 v2       v3 v4   
  <dbl> <chr> <dbl> <chr>
1 0.974 j      1.09 J   

数据

# Sample data borrowed from @joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)

A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.

Code

library(dplyr)
library(hablar)

# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()

# Check max for each column
d %>% summarise_all(max)

Result

Not the new column types.

     v1 v2       v3 v4   
  <dbl> <chr> <dbl> <chr>
1 0.974 j      1.09 J   

Data

# Sample data borrowed from @joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
谁的年少不轻狂 2024-12-10 10:38:42
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df

my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df

my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文