使用 R 下载压缩数据文件、提取和导入数据
@EZGraphs 在 Twitter 上写道: “很多在线 csv 都被压缩了。有没有办法下载、解压缩存档,然后使用 R 将数据加载到 data.frame?#Rstats”
我今天也尝试这样做,但最终只是下载了 zip手动归档。
我尝试过类似的事情:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
但我感觉好像还有很长的路要走。 有什么想法吗?
@EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
Zip 存档实际上更像是一个包含内容元数据等的“文件系统”。有关详细信息,请参阅
help(unzip)
。因此,要执行上面列出的操作,您需要tempfile()
)download.file()
将文件提取到临时文件中。 fileunz()
从temp中提取目标文件。 fileunlink()
删除临时文件,该文件在代码中(感谢基本示例,但这更简单)看起来像
压缩的 (
.z
) 或 gzipped (.gz
)或 bzip2ed(.bz2
)文件只是文件,您可以直接从连接读取这些文件。所以让数据提供者使用它:)Zip archives are actually more a 'filesystem' with content metadata etc. See
help(unzip)
for details. So to do what you sketch out above you need totempfile()
)download.file()
to fetch the file into the temp. fileunz()
to extract the target file from temp. fileunlink()
which in code (thanks for basic example, but this is simpler) looks like
Compressed (
.z
) or gzipped (.gz
) or bzip2ed (.bz2
) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)仅供记录,我尝试将德克的答案翻译成代码:-P
Just for the record, I tried translating Dirk's answer into code :-P
我使用了 CRAN 包“downloader”,位于 http://cran.r- project.org/web/packages/downloader/index.html 。容易多了。
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
对于 Mac(我假设是 Linux)...
如果 zip 存档包含单个文件,您可以使用 bash 命令
funzip
,与fread
结合使用>data.table 包:如果存档包含多个文件,您可以使用
tar
将特定文件提取到 stdout:For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command
funzip
, in conjuction withfread
from thedata.table
package:In cases where the archive contains multiple files, you can use
tar
instead to extract a specific file to stdout:下面是一个适用于无法使用 read.table 函数读入的文件的示例。此示例读取 .xls 文件。
Here is an example that works for files which cannot be read in with the
read.table
function. This example reads a .xls file.使用
library(archive)
,人们还可以读取存档中的特定 csv 文件,而无需先解压缩它; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())我觉得这更方便&更快。
它还支持所有主要的存档格式和文件格式。比基本的 R untar 或 unz 快得多 - 它支持 tar、ZIP、7-zip、RAR、CAB、gzip、bzip2、compress、lzma、xz 和 zip。 uu编码的文件。
要解压缩所有内容,可以使用
archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
这适用于所有平台和平台。鉴于对我来说优越的性能将是首选。
Using
library(archive)
one can also read in a particular csv file within the archive, without having to UNZIP it first;read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use
archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
要使用 data.table 执行此操作,我发现以下方法有效。不幸的是,该链接不再起作用,因此我使用了另一个数据集的链接。
我知道这可以在一行中实现,因为您可以将 bash 脚本传递给
fread
,但我不确定如何下载 .zip 文件、提取并将单个文件从该文件传递给恐惧
。To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
I know this is possible in a single line since you can pass bash scripts to
fread
, but I am not sure how to download a .zip file, extract, and pass a single file from that tofread
.试试这个代码。它对我有用:
示例:
Try this code. It works for me:
Example:
rio() 非常适合这种情况 - 它使用文件名的文件扩展名来确定它是什么类型的文件,因此它可以处理多种文件类型。我还使用
unzip()
列出 zip 文件中的文件名,因此无需手动指定文件名。rio()
would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also usedunzip()
to list the file names within the zip file, so its not necessary to specify the file name(s) manually.我发现以下内容对我有用。这些步骤来自 BTD 的 YouTube 视频在 R 中管理 Zipfile:
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R: