在 R 中逐行阅读的好方法是什么?
我有一个文件,其中每一行都是在实验的特定重复中收集的一组结果。每个实验中的结果数量(即每行中的列数)可能不同。每行中结果的顺序也不重要(第 1 行中的第一个结果和第 2 行中的第一个结果并不比任何其他对更相关;这些是结果的集)。
该文件看起来像这样:
2141 0 5328 5180 357 5335 1 5453 5325 5226 7 4880 5486 0
2650 0 5280 4980 5243 5301 4244 5106 5228 5068 5448 3915 4971 5585 4818 4388 5497 4914 5364 4849 4820 4370
2069 2595 2478 4941
2627 3319 5192 5106 32 4666 3999 5503 5085 4855 4135 4383 4770
2005 2117 2803 2722 2281 2248 2580 2697 2897 4417 4094 4722 5138 5004 4551 5758 5468 17361
1914 1977 2414 100 2711 2171 3041 5561 4870 4281 4691 4461 5298 3849 5166 5578 5520 4634 4836 4905 5105 5089
2539 2326 0 4617 3735 0 5122 5439 5238 1
25 5316 21173 4492 5038 5944 5576 5424 5139 5184 5 5096 4963 2771 2808 2592 2
4963 9428 17152 5467 5202 6038 5094 5221 5469 5079 3753 5080 5141 4097 5173 11338 4693 5273 5283 5110 4503 51
2024 2 2822 5097 5239 5296 4561
除了每行更长(最多几千个值)。可以看出,所有值都是非负整数。
简而言之,这不是一个普通的表,其中的列是有意义的。它只是一堆结果 - 每个结果都排成一行。
我想读取所有结果,然后对每个实验(行)进行一些操作,例如计算 ecdf。我还想计算所有重复的平均 ecdf。
我的问题 - 我应该如何阅读这个看起来奇怪的文件?我太习惯了 read.table 以至于我不确定我是否尝试过其他任何东西......我是否必须使用一些低级的东西 阅读行
?我想首选的输出是向量列表(或向量?)。我查看了 scan
但似乎所有向量的长度都必须相同。
任何建议将不胜感激。
更新按照下面的建议,我现在做这样的事情:
con <- file('myfile')
open(con);
results.list <- list();
current.line <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
results.list[[current.line]] <- as.integer(unlist(strsplit(line, split=" ")))
current.line <- current.line + 1
}
close(con)
似乎有效。看起来还好吗?
当我 summary(results.list)
时,我得到:Length Class Mode
Length Class Mode
[1,] 1091 -none- numeric
[2,] 1070 -none- numeric
....
难道类不应该是整数吗?模式是什么?
I have a file where each line is a set of results collected in specific replicate of an experiment. The number of results in each experiment (i.e. number of columns in each row) may differ. There's also no importance to the order of the results in each row (the first result in row 1 and the first result 2 are not more related than any other pair; these are sets of results).
The file looks something like this:
2141 0 5328 5180 357 5335 1 5453 5325 5226 7 4880 5486 0
2650 0 5280 4980 5243 5301 4244 5106 5228 5068 5448 3915 4971 5585 4818 4388 5497 4914 5364 4849 4820 4370
2069 2595 2478 4941
2627 3319 5192 5106 32 4666 3999 5503 5085 4855 4135 4383 4770
2005 2117 2803 2722 2281 2248 2580 2697 2897 4417 4094 4722 5138 5004 4551 5758 5468 17361
1914 1977 2414 100 2711 2171 3041 5561 4870 4281 4691 4461 5298 3849 5166 5578 5520 4634 4836 4905 5105 5089
2539 2326 0 4617 3735 0 5122 5439 5238 1
25 5316 21173 4492 5038 5944 5576 5424 5139 5184 5 5096 4963 2771 2808 2592 2
4963 9428 17152 5467 5202 6038 5094 5221 5469 5079 3753 5080 5141 4097 5173 11338 4693 5273 5283 5110 4503 51
2024 2 2822 5097 5239 5296 4561
except each line is much longer (up to a few thousand values). As can be seen, all values are non-negative integers.
To put it short - this is not a normal table, where the columns have meanings. Its just a bunch of results - each set in a line.
I would like to read all the results, then do some operations on each experiment (row), such as calculating the ecdf. I would also like to calculate the average ecdf over all the replicates.
My problem - how should I read this strange looking file? I'm so use to read.table
that I'm not sure I ever tried anything else... Do I have to use some low-level likereadlines
? I guess the preferred output would be a list (or vector?) of vectors. I looked at scan
but it seems all vectors must be of the same length there.
Any suggestions will be appreciated.
UPDATE Following the suggestions below, I now do something like this:
con <- file('myfile')
open(con);
results.list <- list();
current.line <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
results.list[[current.line]] <- as.integer(unlist(strsplit(line, split=" ")))
current.line <- current.line + 1
}
close(con)
Seems to work. Does it looks OK?
When I summary(results.list)
I get:Length Class Mode
Length Class Mode
[1,] 1091 -none- numeric
[2,] 1070 -none- numeric
....
Shouldn't the class be integer? And what is the mode?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
乔什链接的示例是我一直使用的示例。
我编辑了示例以根据示例数据创建两个列表。 dataList 是一个列表,其中列表中的每一项都是文本文件中每一行的数值向量。 ecdfList 是一个列表,其中每个元素都是文本文件中每一行的 ecdf。
您可能应该在其中添加一些 try() 或 trycatch() 逻辑,以正确处理由于空值或其他原因而无法创建 ecdf 的情况。但上面的例子应该已经很接近了。祝你好运!
The example Josh linked to is one that I use all the time.
I edited the example to create two lists from your example data. dataList is a list where each item in the list is a vector of numeric values from each line in your text file. ecdfList is a list where each element is an ecdf for each line in your text file.
You should probably add some try() or trycatch() logic in there to properly handle situations where the ecdf can't be created because of nulls or some such. But the above example should get you pretty close. Good luck!
是的,您可以使用
readLines
。 JD 长有一个很好的例子,我对其进行了稍微编辑并在下面提供。Yes you can use
readLines
. JD Long has a good example, which I've edited slightly and provided below.为什么要费心逐行阅读呢?
给出整数向量列表。
关于您的其他问题:看看
?mode
(简而言之 -mode
是数字的数字,typeof
可以是整数或双精度,并且class
数字或整数)。要查看是否有整数,请检查str(results.list)
或lapply(results.list, class)
。Why bother with line-by-line reading?
gives you list of integer vectors.
About your additional questions: take a look at
?mode
(in short -mode
is numeric for numbers,typeof
can be integer or double, andclass
numeric or integer). To see if there are integers checkstr(results.list)
orlapply(results.list, class)
.或者:
Or:
用于
从连接
con
读取一行,可以像con <- file(filename, "r")
一样简单。Use
to read one line from connection
con
, which can be as simple ascon <- file(filename, "r")
.如果你知道文件中的值是整数,你可以使用
scan()
而不是readLines()
,而且也是在一个循环中:你将得到一个列表数值向量。
if you know that the values in the file are integers, you can use
scan()
instead ofreadLines()
, but also in a loop:You will get a list of numeric vectors.