默认情况下,使用 read.zoo 停止将数据读取为因素
我正在使用 R 中的 zoo
包来分析数据的时间序列。我有以下数据文件:
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2011,09:41:28,88.403796,N/A,0.440362,0.513093,0.676703,N/A,N/A,N/A,N/A,N/A,0.893867,N/A,N/A,0.965588,N/A,1.034943,1.079975,1.654521,N/A,12.867837,12.687550,11.037238,N/A,N/A,N/A,N/A,N/A,9.345739,N/A,N/A,8.423888,N/A,8.421787,9.334135,1.622026,0.937815,0.529939,0.852553,0.999260,0.431102,N/A,13/04/2011,57.070624
29:03:2011,10:11:29,88.424641,N/A,0.565148,0.654724,0.842142,N/A,N/A,N/A,N/A,N/A,1.070556,N/A,N/A,1.144966,N/A,1.208759,1.242663,1.666760,N/A,9.933505,9.499251,8.327355,N/A,N/A,N/A,N/A,N/A,6.781617,N/A,N/A,6.612952,N/A,5.600500,5.630695,1.302058,0.826713,0.438445,0.736362,0.884554,0.316539,N/A,13/04/2011,53.916620
29:03:2011,10:17:46,88.429005,N/A,0.593881,0.681572,0.866620,N/A,N/A,N/A,N/A,N/A,1.095508,N/A,N/A,1.168008,N/A,1.233022,1.268572,1.704882,N/A,4.072782,3.752197,3.210935,N/A,N/A,N/A,N/A,N/A,2.389567,N/A,N/A,2.385582,N/A,1.653326,1.015620,0.728711,0.798185,0.427272,0.716165,0.853963,0.319100,N/A,13/04/2011,53.323057
29:03:2011,10:26:27,88.435035,N/A,0.636627,0.714175,0.884887,N/A,N/A,N/A,N/A,N/A,1.092220,N/A,N/A,1.167024,N/A,1.224264,1.271774,1.626393,N/A,16.400200,10.585139,6.513873,N/A,N/A,N/A,N/A,N/A,3.169704,N/A,N/A,4.085949,N/A,3.963741,8.663229,10.035231,0.724581,0.411533,0.659996,0.764539,0.329073,N/A,13/04/2011,52.544475
我尝试使用以下代码读取它:
f <- function(d, t) as.chron(paste(as.Date(chron(d, format='d:m:y')), t))
z = read.zoo("110329_110329_Chilbolton.lev10", sep=',', header=T, index = 1:2, FUN=f, as.is=F, dec=".")
但是数据集的所有列都被读取为因子 - 因此,当我执行 summary(z)
时,我得到输出如下:
X.TripletVar_340 X.WaterError X440.870Angstrom X380.500Angstrom X440.675Angstrom X500.870Angstrom
1.015620:1 0.728711:1 0.724581:1 0.411533:1 0.659996:1 0.764539:1
2.522511:1 1.302058:1 0.798185:1 0.427272:1 0.716165:1 0.853963:1
5.630695:1 1.622026:1 0.826713:1 0.438445:1 0.736362:1 0.884554:1
8.663229:1 2.309844:1 0.851964:1 0.497006:1 0.789257:1 0.898093:1
9.334135:1 10.035231:1 0.937815:1 0.529939:1 0.852553:1 0.999260:1
我怎样才能阻止它默认读取数据作为因子? read.table
可以很好地读取数据,无需任何额外的参数来告诉它确保所有内容都保持为数字而不是因子 - 那么为什么 read.zoo
的行为不同呢?
我想我可以使用 colClasses 来指定每列的类型,但我不想这样做,以防数据集中列的顺序发生更改 - 默认情况下将其转换为数字,然后尝试因素不起作用会好得多。
有什么想法吗?
I am using the zoo
package in R to analyse time series of data. I have the following data file:
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2011,09:41:28,88.403796,N/A,0.440362,0.513093,0.676703,N/A,N/A,N/A,N/A,N/A,0.893867,N/A,N/A,0.965588,N/A,1.034943,1.079975,1.654521,N/A,12.867837,12.687550,11.037238,N/A,N/A,N/A,N/A,N/A,9.345739,N/A,N/A,8.423888,N/A,8.421787,9.334135,1.622026,0.937815,0.529939,0.852553,0.999260,0.431102,N/A,13/04/2011,57.070624
29:03:2011,10:11:29,88.424641,N/A,0.565148,0.654724,0.842142,N/A,N/A,N/A,N/A,N/A,1.070556,N/A,N/A,1.144966,N/A,1.208759,1.242663,1.666760,N/A,9.933505,9.499251,8.327355,N/A,N/A,N/A,N/A,N/A,6.781617,N/A,N/A,6.612952,N/A,5.600500,5.630695,1.302058,0.826713,0.438445,0.736362,0.884554,0.316539,N/A,13/04/2011,53.916620
29:03:2011,10:17:46,88.429005,N/A,0.593881,0.681572,0.866620,N/A,N/A,N/A,N/A,N/A,1.095508,N/A,N/A,1.168008,N/A,1.233022,1.268572,1.704882,N/A,4.072782,3.752197,3.210935,N/A,N/A,N/A,N/A,N/A,2.389567,N/A,N/A,2.385582,N/A,1.653326,1.015620,0.728711,0.798185,0.427272,0.716165,0.853963,0.319100,N/A,13/04/2011,53.323057
29:03:2011,10:26:27,88.435035,N/A,0.636627,0.714175,0.884887,N/A,N/A,N/A,N/A,N/A,1.092220,N/A,N/A,1.167024,N/A,1.224264,1.271774,1.626393,N/A,16.400200,10.585139,6.513873,N/A,N/A,N/A,N/A,N/A,3.169704,N/A,N/A,4.085949,N/A,3.963741,8.663229,10.035231,0.724581,0.411533,0.659996,0.764539,0.329073,N/A,13/04/2011,52.544475
I am trying to read it using the following code:
f <- function(d, t) as.chron(paste(as.Date(chron(d, format='d:m:y')), t))
z = read.zoo("110329_110329_Chilbolton.lev10", sep=',', header=T, index = 1:2, FUN=f, as.is=F, dec=".")
But all of the columns of the dataset are being read as factors - so, when I do summary(z)
I get output like:
X.TripletVar_340 X.WaterError X440.870Angstrom X380.500Angstrom X440.675Angstrom X500.870Angstrom
1.015620:1 0.728711:1 0.724581:1 0.411533:1 0.659996:1 0.764539:1
2.522511:1 1.302058:1 0.798185:1 0.427272:1 0.716165:1 0.853963:1
5.630695:1 1.622026:1 0.826713:1 0.438445:1 0.736362:1 0.884554:1
8.663229:1 2.309844:1 0.851964:1 0.497006:1 0.789257:1 0.898093:1
9.334135:1 10.035231:1 0.937815:1 0.529939:1 0.852553:1 0.999260:1
How can I stop it reading the data as factors by default? The data is read fine by read.table
without any extra parameters to tell it to make sure everything stays as numbers not factors - so why is read.zoo
behaving differently?
I suppose I could use colClasses to specify the type of each column, but I'd rather not do this in case the order of the columns in the dataset is changed - getting it to convert to numbers by default, and then try factors if that doesn't work would be far better.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这已经被诊断出来,但让我们添加它,以便我们有一个可以在此处使用的
read.zoo
语句的示例。有两个问题:(1)NA 被表示为 N/A 而不是 NA,所以我们必须告诉它这一点。 (2) 倒数第二列不是数字。 Zoo 将数据表示为矩阵,因此它必须全部是数字(也支持因子 Zoo 对象,但不能混合)。
试试这个(为了更好的测量,我们在示例中添加了第二条数据线)。请务必使用最新版本的 Zoo 来运行示例数据,因为
text=
参数(指定数据本身的文本而不是文件名)是最近才添加的。另请注意,R?read.zoo
内部提供了帮助,而vignette("zoo-read")
提供了完全致力于read.zoo
的文档> 示例。This has been diagnosed already but let us add this so that we have an example of a
read.zoo
statement that could be used here.There are two problems: (1) the NAs are represented as N/A rather than NA so we must tell it that. (2) the second last column is not numeric. zoo represents the data as a matrix so it must all be numeric (factor zoo objects are supported too but they can't be mixed).
Try this (where we have added a second data line to the example for good measure). Be sure to use the most recent version of zoo to run the example data since the
text=
argument (which specifies the text of the data itself rather than the filename) was only added recently. Also note that from within R?read.zoo
gives help andvignette("zoo-read")
gives a document entirely devoted toread.zoo
examples.您的数据文件给
read.zoo
带来了两个问题。首先,它使用
N/A
来表示缺失值,而不是默认使用read.table()
所期望的字符串NA
。可以通过设置na.strings="N/A"
来修复此问题。第二个问题是数据文件的倒数第二列
Last_Processing_Date.dd.mm.yyyy
包含字符串。但是,根据动物园常见问题解答文档 (警告,PDF) :
当“被要求”读取一堆包含数字字符值的列时,将所有内容转换为因子是
read.zoo()
可以生成符合这三个条件之一的对象的唯一方法。如果您删除有问题的列,并指定缺失值字符串,那么一切都会顺利进行。如果您确实需要数字列和因子列,上面链接的常见问题解答建议了几种可能的方法。
Your data file poses two problems for
read.zoo
.First, it uses
N/A
to denote missing values, rather than the stringNA
, whichread.table()
expects by default. This can be fixed by settingna.strings="N/A"
.The second problem is that the data file's next-to-last column,
Last_Processing_Date.dd.mm.yyyy
, contains character strings.But, according to the zoo FAQ document (warning, PDF):
When 'asked' to read in a bunch of columns that contain both numeric character values, converting everything to factors is the only way that
read.zoo()
can produce an object fitting one of those three criteria.If you remove the offending column, and specify your missing value string, everything works without a hitch. If you do need both numeric and factor columns, the FAQ linked above suggests several possible approaches.
问题似乎是您从 Excel 文件导入,而没有花时间将“N/A”值转换为正确的 NA 值。这会导致列被视为非数字。 Zoo 包需要将核心数据作为矩阵,这严重限制了可用于处理的选项。一切都需要是数字。即使您输入 stringsAsFactors = FALSE,您仍然会得到您期望数字的字符列。
如果您使用 read.table 读入并设置
as.is=TRUE,
您可以克服因子问题。然后,您需要强制转换为数字的列,并删除名称为“Last_Processing_Date.dd.mm.yyyy”的尾随日期列。我会首先这样做:
然后选择要强制为数字的列:
在第 44 列中保持日期列完整。然后决定哪些列应进入动物园对象。
编辑:我看到 Gabor Grothendieck 也解决了这些问题,这是应该的,因为他是该包的作者之一。
The problem appears to be that you are importing from an Excel file and not taking the time to make the "N/A" values into proper NA values. That results in the columns being considered non-numeric. The zoo package need the coredata to be a matrix and that severely constrains the option available for processing. Everything needs to be numeric. Even if you put in stringsAsFactors = FALSE you would still get character columns where you expected numeric.
If you read in with read.table and set
as.is=TRUE,
you can overcome the factor problem. You then need to coerce the columns that you want to be numeric and drop the trailing date columns that will come in with a name of "Last_Processing_Date.dd.mm.yyyy."I would do this first:
And then choose the columns to coerce to numeric:
Keeping that date column intact in the 44th column. Then decide which columns should go into the zoo object.
Edit: I see Gabor Grothendieck has addressed these problems as well which is as it should be since he is one of the authors of the package.
read.zoo()
中的...
可以让您将stringsAsFactors = F
传递给read.table()< /代码>。那应该可以解决问题。
The
...
inread.zoo()
will let you pass astringsAsFactors = F
on toread.table()
. That should do the trick.