如何在 R 中以制表符分隔的分隔文件的标题之前跳过额外的行

发布于 2024-09-05 12:19:19 字数 821 浏览 8 评论 0原文

我正在使用的软件生成日志文件，其中包含可变行数的摘要信息，后跟大量制表符分隔的数据。我正在尝试编写一个函数，将这些日志文件中的数据读取到数据框中，而忽略摘要信息。摘要信息从不包含制表符，因此以下功能有效：

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n")
  first.line <- min(grep("\\t", lines))
  return(read.delim(file.name, skip=first.line-1, ...))
}

但是，这些日志文件非常大，因此读取文件两次非常慢。当然有更好的方法吗？

编辑添加：

Marek 建议使用 textConnection 对象。他在答案中建议的方式在大文件上失败，但以下方法有效：

read.parameters <- function(file.name, ...){
  conn = file(file.name, "r")
  on.exit(close(conn))
  repeat{
    line = readLines(conn, 1)
    if (length(grep("\\t", line))) {
      pushBack(line, conn)
      break}}
  df <- read.delim(conn, ...)
  return(df)}

再次编辑：感谢 Marek 对上述功能的进一步改进。

原文

The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n")
  first.line <- min(grep("\\t", lines))
  return(read.delim(file.name, skip=first.line-1, ...))
}

However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?

Edited to add:

Marek suggested using a textConnection object. The way he suggested in the answer fails on a big file, but the following works:

read.parameters <- function(file.name, ...){
  conn = file(file.name, "r")
  on.exit(close(conn))
  repeat{
    line = readLines(conn, 1)
    if (length(grep("\\t", line))) {
      pushBack(line, conn)
      break}}
  df <- read.delim(conn, ...)
  return(df)}

Edited again: Thanks Marek for further improvement to the above function.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

燕归巢 2024-09-12 12:19:19

你不需要读两遍。对第一个结果使用 textConnection。

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
  first.line <- min(grep("\\t", lines))
  return(read.delim(textConnection(lines), skip=first.line-1, ...))
}

You don't need to read twice. Use textConnection on first result.

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
  first.line <- min(grep("\\t", lines))
  return(read.delim(textConnection(lines), skip=first.line-1, ...))
}

回复收藏 0 原文