使用 R 下载压缩数据文件、提取和导入数据

发布于 2024-09-06 13:00:00 字数 323 浏览 6 评论 0原文

@EZGraphs 在 Twitter 上写道: “很多在线 csv 都被压缩了。有没有办法下载、解压缩存档,然后使用 R 将数据加载到 data.frame?#Rstats”

我今天也尝试这样做,但最终只是下载了 zip手动归档。

我尝试过类似的事情:

fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")

但我感觉好像还有很长的路要走。 有什么想法吗?

@EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"

I was also trying to do this today, but ended up just downloading the zip file manually.

I tried something like:

fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")

but I feel as if I'm a long way off.
Any thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

神经大条 2024-09-13 13:00:00

Zip 存档实际上更像是一个包含内容元数据等的“文件系统”。有关详细信息,请参阅 help(unzip)。因此,要执行上面列出的操作,您需要

  1. 创建一个临时对象。文件名(例如tempfile()
  2. 使用download.file() 将文件提取到临时文件中。 file
  3. 使用unz()从temp中提取目标文件。 file
  4. 通过 unlink() 删除临时文件,

该文件在代码中(感谢基本示例,但这更简单)看起来像

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

压缩的 (.z) 或 gzipped (.gz)或 bzip2ed(.bz2)文件只是文件,您可以直接从连接读取这些文件。所以让数据提供者使用它:)

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

深爱不及久伴 2024-09-13 13:00:00

仅供记录,我尝试将德克的答案翻译成代码:-P

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)

Just for the record, I tried translating Dirk's answer into code :-P

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
美胚控场 2024-09-13 13:00:00

我使用了 CRAN 包“downloader”,位于 http://cran.r- project.org/web/packages/downloader/index.html 。容易多了。

download(url, dest="dataset.zip", mode="wb") 
unzip ("dataset.zip", exdir = "./")

I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.

download(url, dest="dataset.zip", mode="wb") 
unzip ("dataset.zip", exdir = "./")
冬天旳寂寞 2024-09-13 13:00:00

对于 Mac(我假设是 Linux)...

如果 zip 存档包含单个文件,您可以使用 bash 命令 funzip,与 fread 结合使用>data.table 包:

library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")

如果存档包含多个文件,您可以使用 tar 将特定文件提取到 stdout:

dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")

For Mac (and I assume Linux)...

If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:

library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")

In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:

dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
巾帼英雄 2024-09-13 13:00:00

下面是一个适用于无法使用 read.table 函数读入的文件的示例。此示例读取 .xls 文件。

url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))

unlink(c(temp, temp2))

Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.

url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))

unlink(c(temp, temp2))
最偏执的依靠 2024-09-13 13:00:00

使用library(archive),人们还可以读取存档中的特定 csv 文件,而无需先解压缩它; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
我觉得这更方便&更快。

它还支持所有主要的存档格式和文件格式。比基本的 R untar 或 unz 快得多 - 它支持 tar、ZIP、7-zip、RAR、CAB、gzip、bzip2、compress、lzma、xz 和 zip。 uu编码的文件。

要解压缩所有内容,可以使用 archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)

这适用于所有平台和平台。鉴于对我来说优越的性能将是首选。

Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.

It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.

To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)

This works on all platforms & given the superior performance for me would be the preferred option.

念﹏祤嫣 2024-09-13 13:00:00

要使用 data.table 执行此操作,我发现以下方法有效。不幸的是,该链接不再起作用,因此我使用了另一个数据集的链接。

library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)

我知道这可以在一行中实现,因为您可以将 bash 脚本传递给 fread,但我不确定如何下载 .zip 文件、提取并将单个文件从该文件传递给 恐惧

To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.

library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)

I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.

你げ笑在眉眼 2024-09-13 13:00:00

试试这个代码。它对我有用:

unzip(zipfile="<directory and filename>",
      exdir="<directory where the content will be extracted>")

示例:

unzip(zipfile="./data/Data.zip",exdir="./data")

Try this code. It works for me:

unzip(zipfile="<directory and filename>",
      exdir="<directory where the content will be extracted>")

Example:

unzip(zipfile="./data/Data.zip",exdir="./data")
情丝乱 2024-09-13 13:00:00

rio() 非常适合这种情况 - 它使用文件名的文件扩展名来确定它是什么类型的文件,因此它可以处理多种文件类型。我还使用 unzip() 列出 zip 文件中的文件名,因此无需手动指定文件名。

library(rio)

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))

# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))

# delete the files and directories
unlink(td)

rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.

library(rio)

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))

# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))

# delete the files and directories
unlink(td)
情话已封尘 2024-09-13 13:00:00

我发现以下内容对我有用。这些步骤来自 BTD 的 YouTube 视频在 R 中管理 Zipfile

zip.url <- "url_address.zip"

dir <- getwd()

zip.file <- "file_name.zip"

zip.combine <- as.character(paste(dir, zip.file, sep = "/"))

download.file(zip.url, destfile = zip.combine)

unzip(zip.file)

I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:

zip.url <- "url_address.zip"

dir <- getwd()

zip.file <- "file_name.zip"

zip.combine <- as.character(paste(dir, zip.file, sep = "/"))

download.file(zip.url, destfile = zip.combine)

unzip(zip.file)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文