如何在 R 中以制表符分隔的分隔文件的标题之前跳过额外的行
我正在使用的软件生成日志文件,其中包含可变行数的摘要信息,后跟大量制表符分隔的数据。我正在尝试编写一个函数,将这些日志文件中的数据读取到数据框中,而忽略摘要信息。摘要信息从不包含制表符,因此以下功能有效:
read.parameters <- function(file.name, ...){
lines <- scan(file.name, what="character", sep="\n")
first.line <- min(grep("\\t", lines))
return(read.delim(file.name, skip=first.line-1, ...))
}
但是,这些日志文件非常大,因此读取文件两次非常慢。当然有更好的方法吗?
编辑添加:
Marek 建议使用 textConnection
对象。他在答案中建议的方式在大文件上失败,但以下方法有效:
read.parameters <- function(file.name, ...){
conn = file(file.name, "r")
on.exit(close(conn))
repeat{
line = readLines(conn, 1)
if (length(grep("\\t", line))) {
pushBack(line, conn)
break}}
df <- read.delim(conn, ...)
return(df)}
再次编辑:感谢 Marek 对上述功能的进一步改进。
The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:
read.parameters <- function(file.name, ...){
lines <- scan(file.name, what="character", sep="\n")
first.line <- min(grep("\\t", lines))
return(read.delim(file.name, skip=first.line-1, ...))
}
However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?
Edited to add:
Marek suggested using a textConnection
object. The way he suggested in the answer fails on a big file, but the following works:
read.parameters <- function(file.name, ...){
conn = file(file.name, "r")
on.exit(close(conn))
repeat{
line = readLines(conn, 1)
if (length(grep("\\t", line))) {
pushBack(line, conn)
break}}
df <- read.delim(conn, ...)
return(df)}
Edited again: Thanks Marek for further improvement to the above function.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你不需要读两遍。对第一个结果使用
textConnection
。You don't need to read twice. Use
textConnection
on first result.如果您可以确定标题信息不会超过 N 行,例如 N = 200,则尝试:
scan(..., nlines = N)
这样您就不会重新读取超过 N 行。
If you can be sure that the header info won't be more than N lines, e.g. N = 200, then try:
scan(..., nlines = N)
That way you won't re-read more than N lines.