在 R 中自动读取 zip 文件
我需要自动化 R 来读取 zip 文件中的 csv 数据文件。
例如,我会输入:
read.zip(file = "myfile.zip")
在内部,要做的是:
- 将
myfile.zip
解压缩到临时文件夹 - 使用
read.csv
读取其中包含的唯一文件
如果zip 文件中有多个文件,会引发错误。
我的问题是获取 zip 文件中包含的文件名,以便提供它执行 read.csv 命令。有谁知道该怎么做?
更新
这是我根据 @Paul 回答编写的函数:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
由于我将在 tempdir()
中处理更多文件,因此我在其中创建了一个新目录,因此我不会对这些文件感到困惑。我希望它可能有用!
I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
- Unzip
myfile.zip
to a temporary folder - Read the only file contained on it using
read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv
command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on @Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir()
, I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
使用
unz
的另一个解决方案:Another solution using
unz
:您可以使用
unzip
来解压缩该文件。我只是提到这一点,因为从你的问题中并不清楚你是否知道这一点。关于读取文件。将文件提取到临时目录 (?tempdir
) 后,只需使用list.files
查找转储到临时目录中的文件。就您而言,这只是一个文件,即您需要的文件。使用read.csv
读取它非常简单:假设您的
tempdir
位置存储在temp_path
中。You can use
unzip
to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir
), just uselist.files
to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it usingread.csv
is then quite straightforward:assuming your
tempdir
location is stored intemp_path
.我发现这个线程是因为我试图自动从 zip 中读取多个 csv 文件。我根据更广泛的情况调整了解决方案。我还没有测试过它是否有奇怪的文件名等,但这对我有用,所以我想我会分享:
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
如果您的系统上安装了 zcat(Linux、macos 和 cygwin 都是这种情况),您还可以使用:
此解决方案还具有不创建临时文件的优点。
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
This solution also has the advantage that no temporary files are created.
这是我正在使用的一种方法,该方法很大程度上基于@Corned Beef Hash Map 的答案。以下是我所做的一些更改:
我的方法利用了
data.table
包的fread()
,它可以很快(一般来说,如果有拉链,尺寸可能会很大,所以你
在这里可以获得很大的速度!)。
我还调整了输出格式,使其成为命名列表,其中
列表中的每个元素均以文件命名。对我来说,这是一个
非常有用的补充。
而不是使用正则表达式来筛选文件
由 list.files 抓取,我利用
list.file()
的pattern
参数。
最后,我依靠
fread()
并使pattern
成为您可以为其提供诸如
""
或NULL
之类的参数或"."
,可以用它来读入多种类型的数据文件;实际上,您可以一次读取多种类型(如果您的 .zip 包含
.csv、.txt 你都想要,例如)。如果只有某些类型
您想要的文件,您也可以指定仅使用这些文件的模式。
这是实际的功能:
Here is an approach I am using that is based heavily on @Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the
data.table
package'sfread()
, whichcan be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of
list.file()
'spattern
argument.
Finally, I by relying on
fread()
and by makingpattern
anargument to which you could supply something like
""
orNULL
or"."
, you can use this to read in many types of data files; in fact,you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
解压文件位置
outDir<-"~/Documents/unzipFolder"
获取所有 zip 文件
zipF <- list.files(path = "~/Documents/", pattern = "*. zip", full.names = TRUE)
解压所有文件
purrr::map(.x = zipF, .f = unzip, exdir =输出目录)
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
下面对上述答案进行细化。 FUN 可以是 read.csv、cat 或您喜欢的任何内容,只要第一个参数接受文件路径即可。例如
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
另一种使用 data.table 包中的
fread
的方法基于 @joão-daniel 的回答/更新
Another approach that uses
fread
from the data.table packageBased on the answer/update by @joão-daniel
我刚刚编写了一个基于 top read.zip 的函数,它可能会有所帮助......
I just wrote a function based on top read.zip that may help...