在 R 中自动读取 zip 文件

发布于 2024-12-28 19:04:44 字数 1081 浏览 4 评论 0原文

我需要自动化 R 来读取 zip 文件中的 csv 数据文件。

例如，我会输入：

read.zip(file = "myfile.zip")

在内部，要做的是：

将 myfile.zip 解压缩到临时文件夹
使用 read.csv 读取其中包含的唯一文件

如果zip 文件中有多个文件，会引发错误。

我的问题是获取 zip 文件中包含的文件名，以便提供它执行 read.csv 命令。有谁知道该怎么做？

更新

这是我根据 @Paul 回答编写的函数：

read.zip <- function(zipfile, row.names=NULL, dec=".") {
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    dir.create(zipdir)
    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)
    # Get the files into the dir
    files <- list.files(zipdir)
    # Throw an error if there's more than one
    if(length(files)>1) stop("More than one data file inside zip")
    # Get the full name of the file
    file <- paste(zipdir, files[1], sep="/")
    # Read the file
    read.csv(file, row.names, dec)
}

由于我将在 tempdir() 中处理更多文件，因此我在其中创建了一个新目录，因此我不会对这些文件感到困惑。我希望它可能有用！

原文

I need to automate R to read a csv datafile that's into a zip file.

For example, I would type:

read.zip(file = "myfile.zip")

And internally, what would be done is:

Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv

If there is more than one file into the zip file, an error is thrown.

My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?

UPDATE

Here's the function I wrote based on @Paul answer:

read.zip <- function(zipfile, row.names=NULL, dec=".") {
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    dir.create(zipdir)
    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)
    # Get the files into the dir
    files <- list.files(zipdir)
    # Throw an error if there's more than one
    if(length(files)>1) stop("More than one data file inside zip")
    # Get the full name of the file
    file <- paste(zipdir, files[1], sep="/")
    # Read the file
    read.csv(file, row.names, dec)
}

Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

转身以后 2025-01-04 19:04:44

使用 unz 的另一个解决方案：

read.zip <- function(file, ...) {
  zipFileInfo <- unzip(file, list=TRUE)
  if(nrow(zipFileInfo) > 1)
    stop("More than one data file inside zip")
  else
    read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}

Another solution using unz:

read.zip <- function(file, ...) {
  zipFileInfo <- unzip(file, list=TRUE)
  if(nrow(zipFileInfo) > 1)
    stop("More than one data file inside zip")
  else
    read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}

回复收藏 0 原文

安静被遗忘 2025-01-04 19:04:44

您可以使用unzip来解压缩该文件。我只是提到这一点，因为从你的问题中并不清楚你是否知道这一点。关于读取文件。将文件提取到临时目录 (?tempdir) 后，只需使用 list.files 查找转储到临时目录中的文件。就您而言，这只是一个文件，即您需要的文件。使用 read.csv 读取它非常简单：

l = list.files(temp_path)
read.csv(l[1])

假设您的 tempdir 位置存储在 temp_path 中。

You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:

l = list.files(temp_path)
read.csv(l[1])

assuming your tempdir location is stored in temp_path.

回复收藏 0 原文

诠释孤独 2025-01-04 19:04:44

我发现这个线程是因为我试图自动从 zip 中读取多个 csv 文件。我根据更广泛的情况调整了解决方案。我还没有测试过它是否有奇怪的文件名等，但这对我有用，所以我想我会分享：

read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
    fp <- file.path(zipdir, f)
    return(read.csv(fp, ...))
})
return(csv.data)}

I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:

read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
    fp <- file.path(zipdir, f)
    return(read.csv(fp, ...))
})
return(csv.data)}

回复收藏 0 原文

南…巷孤猫 2025-01-04 19:04:44

如果您的系统上安装了 zcat（Linux、macos 和 cygwin 都是这种情况），您还可以使用：

zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))

此解决方案还具有不创建临时文件的优点。

If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:

zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))

This solution also has the advantage that no temporary files are created.

回复收藏 0 原文

很酷又爱笑 2025-01-04 19:04:44

这是我正在使用的一种方法，该方法很大程度上基于@Corned Beef Hash Map 的答案。以下是我所做的一些更改：

我的方法利用了 data.table 包的 fread()，它
可以很快（一般来说，如果有拉链，尺寸可能会很大，所以你
在这里可以获得很大的速度！）。
我还调整了输出格式，使其成为命名列表，其中
列表中的每个元素均以文件命名。对我来说，这是一个
非常有用的补充。
而不是使用正则表达式来筛选文件
由 list.files 抓取，我利用 list.file() 的 pattern
参数。
最后，我依靠 fread() 并使 pattern 成为
您可以为其提供诸如 "" 或 NULL 之类的参数或
"."，可以用它来读入多种类型的数据文件；实际上，
您可以一次读取多种类型（如果您的 .zip 包含
.csv、.txt 你都想要，例如）。如果只有某些类型
您想要的文件，您也可以指定仅使用这些文件的模式。

这是实际的功能：

read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){

    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()

    # Create the dir using that name
    dir.create(zipdir)

    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)

    # Get a list of csv files in the dir
    files <- list.files(zipdir, rec=TRUE, pattern=pattern)

    # Create a list of the imported csv files
    csv.data <- sapply(files, 
        function(f){
            fp <- file.path(zipdir, f)
            dat <- fread(fp, ...)
            return(dat)
        }
    )

    # Use csv names to name list elements
    names(csv.data) <- basename(files)

    # Return data
    return(csv.data)
}

Here is an approach I am using that is based heavily on @Corned Beef Hash Map 's answer. Here are some of the changes I made:

My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.

Here is the actual function:

read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){

    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()

    # Create the dir using that name
    dir.create(zipdir)

    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)

    # Get a list of csv files in the dir
    files <- list.files(zipdir, rec=TRUE, pattern=pattern)

    # Create a list of the imported csv files
    csv.data <- sapply(files, 
        function(f){
            fp <- file.path(zipdir, f)
            dat <- fread(fp, ...)
            return(dat)
        }
    )

    # Use csv names to name list elements
    names(csv.data) <- basename(files)

    # Return data
    return(csv.data)
}

回复收藏 0 原文

不喜欢何必死缠烂打 2025-01-04 19:04:44

解压文件位置

outDir<-"~/Documents/unzipFolder"

获取所有 zip 文件

zipF <- list.files(path = "~/Documents/", pattern = "*. zip", full.names = TRUE)

解压所有文件

purrr::map(.x = zipF, .f = unzip, exdir =输出目录）

回复收藏 0 原文

佼人 2025-01-04 19:04:44

下面对上述答案进行细化。 FUN 可以是 read.csv、cat 或您喜欢的任何内容，只要第一个参数接受文件路径即可。例如

head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))

read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
  zipfile <- tempfile()
  download.file(url = url, destfile = zipfile, quiet = TRUE)
  zipdir <- tempfile()
  dir.create(zipdir)
  unzip(zipfile, exdir = zipdir) # files="" so extract all
  files <- list.files(zipdir)
  if (is.null(filename)) {
    if (length(files) == 1) {
      filename <- files
    } else {
      stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
    }
  } else { # filename specified
    stopifnot(length(filename) ==1)
    stopifnot(filename %in% files)
  }
  file <- paste(zipdir, files[1], sep="/")
  do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}

The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.

head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))

read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
  zipfile <- tempfile()
  download.file(url = url, destfile = zipfile, quiet = TRUE)
  zipdir <- tempfile()
  dir.create(zipdir)
  unzip(zipfile, exdir = zipdir) # files="" so extract all
  files <- list.files(zipdir)
  if (is.null(filename)) {
    if (length(files) == 1) {
      filename <- files
    } else {
      stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
    }
  } else { # filename specified
    stopifnot(length(filename) ==1)
    stopifnot(filename %in% files)
  }
  file <- paste(zipdir, files[1], sep="/")
  do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}

回复收藏 0 原文

背叛残局 2025-01-04 19:04:44

另一种使用 data.table 包中的 fread 的方法

fread.zip <- function(zipfile, ...) {
  # Function reads data from a zipped csv file
  # Uses fread from the data.table package

  ## Create the temporary directory or flush CSVs if it exists already
  if (!file.exists(tempdir())) {dir.create(tempdir())
  } else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
  }

  ## Unzip the file into the dir
  unzip(zipfile, exdir=tempdir())

  ## Get path to file
  file <- list.files(tempdir(), pattern = "*.csv", full.names = T)

  ## Throw an error if there's more than one
  if(length(file)>1) stop("More than one data file inside zip")

  ## Read the file
  fread(file, 
     na.strings = c(""), # read empty strings as NA
     ...
  )
}

基于 @joão-daniel 的回答/更新

Another approach that uses fread from the data.table package

fread.zip <- function(zipfile, ...) {
  # Function reads data from a zipped csv file
  # Uses fread from the data.table package

  ## Create the temporary directory or flush CSVs if it exists already
  if (!file.exists(tempdir())) {dir.create(tempdir())
  } else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
  }

  ## Unzip the file into the dir
  unzip(zipfile, exdir=tempdir())

  ## Get path to file
  file <- list.files(tempdir(), pattern = "*.csv", full.names = T)

  ## Throw an error if there's more than one
  if(length(file)>1) stop("More than one data file inside zip")

  ## Read the file
  fread(file, 
     na.strings = c(""), # read empty strings as NA
     ...
  )
}

Based on the answer/update by @joão-daniel

回复收藏 0 原文

梦幻的心爱 2025-01-04 19:04:44

我刚刚编写了一个基于 top read.zip 的函数，它可能会有所帮助......

read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
    # function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r

    # check the files within zip
    unzfiles <- unzip(zipfile, list=TRUE)
    if (is.na(internalfile) || is.numeric(internalfile)) {
        internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
    }
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    if (verbose) catf("Directory created:",zipdir,"\n")
    dir.create(zipdir)
    # Unzip the file into the dir
    if (verbose) catf("Unzipping file:",internalfile,"...")
    unzip(zipfile, file=internalfile, exdir=zipdir)
    if (verbose) catf("Done!\n")
    # Get the full name of the file
    file <- paste(zipdir, internalfile, sep="/")
    if (verbose) 
        on.exit({ 
            catf("Done!\nRemoving temporal files:",file,".\n") 
            file.remove(file)
            file.remove(zipdir)
            }) 
    else
        on.exit({file.remove(file); file.remove(zipdir);})
    # Read the file
    if (verbose) catf("Reading File...")
    read.function(file, ...)
}

I just wrote a function based on top read.zip that may help...

read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
    # function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r

    # check the files within zip
    unzfiles <- unzip(zipfile, list=TRUE)
    if (is.na(internalfile) || is.numeric(internalfile)) {
        internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
    }
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    if (verbose) catf("Directory created:",zipdir,"\n")
    dir.create(zipdir)
    # Unzip the file into the dir
    if (verbose) catf("Unzipping file:",internalfile,"...")
    unzip(zipfile, file=internalfile, exdir=zipdir)
    if (verbose) catf("Done!\n")
    # Get the full name of the file
    file <- paste(zipdir, internalfile, sep="/")
    if (verbose) 
        on.exit({ 
            catf("Done!\nRemoving temporal files:",file,".\n") 
            file.remove(file)
            file.remove(zipdir)
            }) 
    else
        on.exit({file.remove(file); file.remove(zipdir);})
    # Read the file
    if (verbose) catf("Reading File...")
    read.function(file, ...)
}

回复收藏 0 原文

~没有更多了~