默认情况下,使用 read.zoo 停止将数据读取为因素

发布于 2024-12-17 09:40:17 字数 3448 浏览 2 评论 0原文

我正在使用 R 中的 zoo 包来分析数据的时间序列。我有以下数据文件:

Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2011,09:41:28,88.403796,N/A,0.440362,0.513093,0.676703,N/A,N/A,N/A,N/A,N/A,0.893867,N/A,N/A,0.965588,N/A,1.034943,1.079975,1.654521,N/A,12.867837,12.687550,11.037238,N/A,N/A,N/A,N/A,N/A,9.345739,N/A,N/A,8.423888,N/A,8.421787,9.334135,1.622026,0.937815,0.529939,0.852553,0.999260,0.431102,N/A,13/04/2011,57.070624
29:03:2011,10:11:29,88.424641,N/A,0.565148,0.654724,0.842142,N/A,N/A,N/A,N/A,N/A,1.070556,N/A,N/A,1.144966,N/A,1.208759,1.242663,1.666760,N/A,9.933505,9.499251,8.327355,N/A,N/A,N/A,N/A,N/A,6.781617,N/A,N/A,6.612952,N/A,5.600500,5.630695,1.302058,0.826713,0.438445,0.736362,0.884554,0.316539,N/A,13/04/2011,53.916620
29:03:2011,10:17:46,88.429005,N/A,0.593881,0.681572,0.866620,N/A,N/A,N/A,N/A,N/A,1.095508,N/A,N/A,1.168008,N/A,1.233022,1.268572,1.704882,N/A,4.072782,3.752197,3.210935,N/A,N/A,N/A,N/A,N/A,2.389567,N/A,N/A,2.385582,N/A,1.653326,1.015620,0.728711,0.798185,0.427272,0.716165,0.853963,0.319100,N/A,13/04/2011,53.323057
29:03:2011,10:26:27,88.435035,N/A,0.636627,0.714175,0.884887,N/A,N/A,N/A,N/A,N/A,1.092220,N/A,N/A,1.167024,N/A,1.224264,1.271774,1.626393,N/A,16.400200,10.585139,6.513873,N/A,N/A,N/A,N/A,N/A,3.169704,N/A,N/A,4.085949,N/A,3.963741,8.663229,10.035231,0.724581,0.411533,0.659996,0.764539,0.329073,N/A,13/04/2011,52.544475

我尝试使用以下代码读取它:

f <- function(d, t) as.chron(paste(as.Date(chron(d, format='d:m:y')), t))

z = read.zoo("110329_110329_Chilbolton.lev10", sep=',', header=T, index = 1:2, FUN=f, as.is=F, dec=".")

但是数据集的所有列都被读取为因子 - 因此,当我执行 summary(z) 时,我得到输出如下:

X.TripletVar_340    X.WaterError X440.870Angstrom X380.500Angstrom X440.675Angstrom X500.870Angstrom
 1.015620:1        0.728711:1     0.724581:1       0.411533:1       0.659996:1       0.764539:1      
 2.522511:1        1.302058:1     0.798185:1       0.427272:1       0.716165:1       0.853963:1      
 5.630695:1        1.622026:1     0.826713:1       0.438445:1       0.736362:1       0.884554:1      
 8.663229:1        2.309844:1     0.851964:1       0.497006:1       0.789257:1       0.898093:1      
 9.334135:1       10.035231:1     0.937815:1       0.529939:1       0.852553:1       0.999260:1      

我怎样才能阻止它默认读取数据作为因子? read.table 可以很好地读取数据,无需任何额外的参数来告诉它确保所有内容都保持为数字而不是因子 - 那么为什么 read.zoo 的行为不同呢?

我想我可以使用 colClasses 来指定每列的类型,但我不想这样做,以防数据集中列的顺序发生更改 - 默认情况下将其转换为数字,然后尝试因素不起作用会好得多。

有什么想法吗?

I am using the zoo package in R to analyse time series of data. I have the following data file:

Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2011,09:41:28,88.403796,N/A,0.440362,0.513093,0.676703,N/A,N/A,N/A,N/A,N/A,0.893867,N/A,N/A,0.965588,N/A,1.034943,1.079975,1.654521,N/A,12.867837,12.687550,11.037238,N/A,N/A,N/A,N/A,N/A,9.345739,N/A,N/A,8.423888,N/A,8.421787,9.334135,1.622026,0.937815,0.529939,0.852553,0.999260,0.431102,N/A,13/04/2011,57.070624
29:03:2011,10:11:29,88.424641,N/A,0.565148,0.654724,0.842142,N/A,N/A,N/A,N/A,N/A,1.070556,N/A,N/A,1.144966,N/A,1.208759,1.242663,1.666760,N/A,9.933505,9.499251,8.327355,N/A,N/A,N/A,N/A,N/A,6.781617,N/A,N/A,6.612952,N/A,5.600500,5.630695,1.302058,0.826713,0.438445,0.736362,0.884554,0.316539,N/A,13/04/2011,53.916620
29:03:2011,10:17:46,88.429005,N/A,0.593881,0.681572,0.866620,N/A,N/A,N/A,N/A,N/A,1.095508,N/A,N/A,1.168008,N/A,1.233022,1.268572,1.704882,N/A,4.072782,3.752197,3.210935,N/A,N/A,N/A,N/A,N/A,2.389567,N/A,N/A,2.385582,N/A,1.653326,1.015620,0.728711,0.798185,0.427272,0.716165,0.853963,0.319100,N/A,13/04/2011,53.323057
29:03:2011,10:26:27,88.435035,N/A,0.636627,0.714175,0.884887,N/A,N/A,N/A,N/A,N/A,1.092220,N/A,N/A,1.167024,N/A,1.224264,1.271774,1.626393,N/A,16.400200,10.585139,6.513873,N/A,N/A,N/A,N/A,N/A,3.169704,N/A,N/A,4.085949,N/A,3.963741,8.663229,10.035231,0.724581,0.411533,0.659996,0.764539,0.329073,N/A,13/04/2011,52.544475

I am trying to read it using the following code:

f <- function(d, t) as.chron(paste(as.Date(chron(d, format='d:m:y')), t))

z = read.zoo("110329_110329_Chilbolton.lev10", sep=',', header=T, index = 1:2, FUN=f, as.is=F, dec=".")

But all of the columns of the dataset are being read as factors - so, when I do summary(z) I get output like:

X.TripletVar_340    X.WaterError X440.870Angstrom X380.500Angstrom X440.675Angstrom X500.870Angstrom
 1.015620:1        0.728711:1     0.724581:1       0.411533:1       0.659996:1       0.764539:1      
 2.522511:1        1.302058:1     0.798185:1       0.427272:1       0.716165:1       0.853963:1      
 5.630695:1        1.622026:1     0.826713:1       0.438445:1       0.736362:1       0.884554:1      
 8.663229:1        2.309844:1     0.851964:1       0.497006:1       0.789257:1       0.898093:1      
 9.334135:1       10.035231:1     0.937815:1       0.529939:1       0.852553:1       0.999260:1      

How can I stop it reading the data as factors by default? The data is read fine by read.table without any extra parameters to tell it to make sure everything stays as numbers not factors - so why is read.zoo behaving differently?

I suppose I could use colClasses to specify the type of each column, but I'd rather not do this in case the order of the columns in the dataset is changed - getting it to convert to numbers by default, and then try factors if that doesn't work would be far better.

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

笨死的猪 2024-12-24 09:40:17

这已经被诊断出来,但让我们添加它,以便我们有一个可以在此处使用的 read.zoo 语句的示例。

有两个问题:(1)NA 被表示为 N/A 而不是 NA,所以我们必须告诉它这一点。 (2) 倒数第二列不是数字。 Zoo 将数据表示为矩阵,因此它必须全部是数字(也支持因子 Zoo 对象,但不能混合)。

试试这个(为了更好的测量,我们在示例中添加了第二条数据线)。请务必使用最新版本的 Zoo 来运行示例数据,因为 text= 参数(指定数据本身的文本而不是文件名)是最近才添加的。另请注意,R ?read.zoo 内部提供了帮助,而 vignette("zoo-read") 提供了完全致力于 read.zoo 的文档> 示例。

Lines <- "Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2012,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462"

library(chron)
library(zoo)
colClasses <- c("character", "character", rep("numeric", 43))
colClasses[44] <- "NULL" # zap the non-numeric column
z <- read.zoo(text = Lines, header = TRUE, sep = ",", na.strings = "N/A",
    index = 1:2, colClasses = colClasses, FUN = function(d, t)
        as.chron(paste(d, t), "%d:%m:%Y %H:%M:%S"))

This has been diagnosed already but let us add this so that we have an example of a read.zoo statement that could be used here.

There are two problems: (1) the NAs are represented as N/A rather than NA so we must tell it that. (2) the second last column is not numeric. zoo represents the data as a matrix so it must all be numeric (factor zoo objects are supported too but they can't be mixed).

Try this (where we have added a second data line to the example for good measure). Be sure to use the most recent version of zoo to run the example data since the text= argument (which specifies the text of the data itself rather than the filename) was only added recently. Also note that from within R ?read.zoo gives help and vignette("zoo-read") gives a document entirely devoted to read.zoo examples.

Lines <- "Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2012,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462"

library(chron)
library(zoo)
colClasses <- c("character", "character", rep("numeric", 43))
colClasses[44] <- "NULL" # zap the non-numeric column
z <- read.zoo(text = Lines, header = TRUE, sep = ",", na.strings = "N/A",
    index = 1:2, colClasses = colClasses, FUN = function(d, t)
        as.chron(paste(d, t), "%d:%m:%Y %H:%M:%S"))
乱了心跳 2024-12-24 09:40:17

您的数据文件给 read.zoo 带来了两个问题。

首先,它使用 N/A 来表示缺失值,而不是默认使用 read.table() 所期望的字符串 NA。可以通过设置 na.strings="N/A" 来修复此问题。

第二个问题是数据文件的倒数第二列 Last_Processing_Date.dd.mm.yyyy 包含字符串。

但是,根据动物园常见问题解答文档 (警告,PDF)

“zoo”对象可以是 (1) 数值向量、(2) 数值矩阵或 (3) 因子,但也可以是
不包含数字向量和因子。

当“被要求”读取一堆包含数字字符值的列时,将所有内容转换为因子是 read.zoo() 可以生成符合这三个条件之一的对象的唯一方法。

如果您删除有问题的列,并指定缺失值字符串,那么一切都会顺利进行。如果您确实需要数字列和因子列,上面链接的常见问题解答建议了几种可能的方法。

z <- read.table("110329_110329_Chilbolton.lev10", sep=",", header=T,
                stringsAsFactors=FALSE, na.strings="N/A")
z$Last_Processing_Date.dd.mm.yyyy. <- NULL
z <- zoo(x=z[,-1:-2], order.by=f(z[[1]], z[[2]]))
summary(z)

     Index                       Julian_Day       AOT_1640      AOT_1020     
 Min.   :(03/29/11 09:26:28)   Min.   :88.39   Min.   : NA   Min.   :0.4404  
 1st Qu.:(03/29/11 09:41:28)   1st Qu.:88.40   1st Qu.: NA   1st Qu.:0.4902  
 Median :(03/29/11 10:11:29)   Median :88.42   Median : NA   Median :0.5651  
 Mean   :(03/29/11 10:00:44)   Mean   :88.42   Mean   :NaN   Mean   :0.5452  
 3rd Qu.:(03/29/11 10:17:46)   3rd Qu.:88.43   3rd Qu.: NA   3rd Qu.:0.5939  
 Max.   :(03/29/11 10:26:27)   Max.   :88.44   Max.   : NA   Max.   :0.6366  

Your data file poses two problems for read.zoo.

First, it uses N/A to denote missing values, rather than the string NA, which read.table() expects by default. This can be fixed by setting na.strings="N/A".

The second problem is that the data file's next-to-last column, Last_Processing_Date.dd.mm.yyyy, contains character strings.

But, according to the zoo FAQ document (warning, PDF):

A "zoo" object may be (1) a numeric vector, (2) a numeric matrix or (3) a factor but may
not contain both a numeric vector and factor.

When 'asked' to read in a bunch of columns that contain both numeric character values, converting everything to factors is the only way that read.zoo() can produce an object fitting one of those three criteria.

If you remove the offending column, and specify your missing value string, everything works without a hitch. If you do need both numeric and factor columns, the FAQ linked above suggests several possible approaches.

z <- read.table("110329_110329_Chilbolton.lev10", sep=",", header=T,
                stringsAsFactors=FALSE, na.strings="N/A")
z$Last_Processing_Date.dd.mm.yyyy. <- NULL
z <- zoo(x=z[,-1:-2], order.by=f(z[[1]], z[[2]]))
summary(z)

     Index                       Julian_Day       AOT_1640      AOT_1020     
 Min.   :(03/29/11 09:26:28)   Min.   :88.39   Min.   : NA   Min.   :0.4404  
 1st Qu.:(03/29/11 09:41:28)   1st Qu.:88.40   1st Qu.: NA   1st Qu.:0.4902  
 Median :(03/29/11 10:11:29)   Median :88.42   Median : NA   Median :0.5651  
 Mean   :(03/29/11 10:00:44)   Mean   :88.42   Mean   :NaN   Mean   :0.5452  
 3rd Qu.:(03/29/11 10:17:46)   3rd Qu.:88.43   3rd Qu.: NA   3rd Qu.:0.5939  
 Max.   :(03/29/11 10:26:27)   Max.   :88.44   Max.   : NA   Max.   :0.6366  
黎夕旧梦 2024-12-24 09:40:17

问题似乎是您从 Excel 文件导入,而没有花时间将“N/A”值转换为正确的 NA 值。这会导致列被视为非数字。 Zoo 包需要将核心数据作为矩阵,这严重限制了可用于处理的选项。一切都需要是数字。即使您输入 stringsAsFactors = FALSE,您仍然会得到您期望数字的字符列。

如果您使用 read.table 读入并设置 as.is=TRUE, 您可以克服因子问题。然后,您需要强制转换为数字的列,并删除名称为“Last_Processing_Date.dd.mm.yyyy”的尾随日期列。

我会首先这样做:

z = read.table(file.choose(), sep=',', header=T,  as.is=TRUE, dec=".")

然后选择要强制为数字的列:

z[ , 3:43] <- sapply(z[ , 3:43], as.numeric)

在第 44 列中保持日期列完整。然后决定哪些列应进入动物园对象。

编辑:我看到 Gabor Grothendieck 也解决了这些问题,这是应该的,因为他是该包的作者之一。

The problem appears to be that you are importing from an Excel file and not taking the time to make the "N/A" values into proper NA values. That results in the columns being considered non-numeric. The zoo package need the coredata to be a matrix and that severely constrains the option available for processing. Everything needs to be numeric. Even if you put in stringsAsFactors = FALSE you would still get character columns where you expected numeric.

If you read in with read.table and set as.is=TRUE, you can overcome the factor problem. You then need to coerce the columns that you want to be numeric and drop the trailing date columns that will come in with a name of "Last_Processing_Date.dd.mm.yyyy."

I would do this first:

z = read.table(file.choose(), sep=',', header=T,  as.is=TRUE, dec=".")

And then choose the columns to coerce to numeric:

z[ , 3:43] <- sapply(z[ , 3:43], as.numeric)

Keeping that date column intact in the 44th column. Then decide which columns should go into the zoo object.

Edit: I see Gabor Grothendieck has addressed these problems as well which is as it should be since he is one of the authors of the package.

鹊巢 2024-12-24 09:40:17

read.zoo() 中的 ... 可以让您将 stringsAsFactors = F 传递给 read.table()< /代码>。那应该可以解决问题。

The ... in read.zoo() will let you pass a stringsAsFactors = F on to read.table(). That should do the trick.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文