直接从 R 脚本读取 Excel 文件

发布于 2024-11-08 20:14:00 字数 67 浏览 0 评论 0原文

如何将 Excel 文件直接读取到 R 中?或者我应该首先将数据导出到文本或 CSV 文件,然后将该文件导入到 R 中?

How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

浅听莫相离 2024-11-15 20:14:00

让我重申 @Chase 的建议:使用 XLConnect

我认为使用 XLConnect 的原因是:

  1. 跨平台。 XLConnect 是用 Java 编写的,因此可以在 Win、Linux、Mac 上运行,无需更改 R 代码(可能的路径字符串除外)
  2. 无需加载其他内容。只需安装 XLConnect 即可继续生活。
  3. 您只提到读取 Excel 文件,但 XLConnect 也会写入 Excel 文件,包括更改单元格格式。它可以在 Linux 或 Mac 上执行此操作,而不仅仅是 Win。

与其他解决方案相比,XLConnect 有点新,因此在博客文章和参考文档中很少提及。对我来说这非常有用。

Let me reiterate what @Chase recommended: Use XLConnect.

The reasons for using XLConnect are, in my opinion:

  1. Cross platform. XLConnect is written in Java and, thus, will run on Win, Linux, Mac with no change of your R code (except possibly path strings)
  2. Nothing else to load. Just install XLConnect and get on with life.
  3. You only mentioned reading Excel files, but XLConnect will also write Excel files, including changing cell formatting. And it will do this from Linux or Mac, not just Win.

XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.

つ可否回来 2024-11-15 20:14:00

现在有 readxl

readxl 包可以轻松地将数据从 Excel 中取出并输入 R。
与现有包(例如 gdata、xlsx、xlsReadWrite 等)相比
readxl 没有外部依赖项,因此很容易安装和使用
所有操作系统。它旨在处理存储的表格数据
在一张纸中。

readxl 构建在 libxls C 库之上,该库抽象了
底层二进制格式的许多复杂性。

它支持旧版 .xls 格式和 .xlsx

readxl 可从 CRAN 获取,或者您可以使用以下命令从 github 安装它:

# install.packages("devtools")
devtools::install_github("hadley/readxl")

用法

library(readxl)

# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")

# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)

# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")

请注意,虽然描述说“无外部依赖项”,但它确实需要 Rcpp,这又需要 Rtools(对于 Windows)或 Xcode(对于OSX),它们是 R 外部的依赖项。尽管许多人出于其他原因安装了它们。

And now there is readxl:

The readxl package makes it easy to get data out of Excel and into R.
Compared to the existing packages (e.g. gdata, xlsx, xlsReadWrite etc)
readxl has no external dependencies so it's easy to install and use on
all operating systems. It is designed to work with tabular data stored
in a single sheet.

readxl is built on top of the libxls C library, which abstracts away
many of the complexities of the underlying binary format.

It supports both the legacy .xls format and .xlsx

readxl is available from CRAN, or you can install it from github with:

# install.packages("devtools")
devtools::install_github("hadley/readxl")

Usage

library(readxl)

# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")

# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)

# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")

Note that while the description says 'no external dependencies', it does require the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. Though many people have them installed for other reasons.

影子是时光的心 2024-11-15 20:14:00

是的。请参阅R wiki 上的相关页面。简短的回答:来自 gdata 包的 read.xls 在大多数情况下都有效(尽管您需要在系统上安装 Perl——通常在 MacOS 和 Linux 上已经如此,但在 Windows 上需要采取额外的步骤,即参见 http://strawberryperl.com/)。 R wiki 页面上列出了各种警告和替代方案。

我认为不直接执行此操作的唯一原因是您可能想要检查电子表格以查看它是否有故障(奇怪的标题,多个工作表[您一次只能读取一个,尽管您显然可以循环遍历所有工作表] ,包括绘图等)。但是对于一个格式良好、带有普通数字和字符数据的矩形电子表格(即,不是逗号格式的数字、日期、被零除错误的公式、缺失值等......)我通常没有问题随着这个过程。

Yes. See the relevant page on the R wiki. Short answer: read.xls from the gdata package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see http://strawberryperl.com/). There are various caveats, and alternatives, listed on the R wiki page.

The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.

无人问我粥可暖 2024-11-15 20:14:00

编辑 2015 年 10 月: 正如其他人在这里评论的那样 openxlsxreadxl 软件包比 xlsx 软件包快得多,并且实际上能够打开更大的 Excel 文件(>1500 行 > ; 120 列)。 @MichaelChirico 演示了当速度优先时 readxl 更好,并且 openxlsx 取代了 xlsx 包提供的功能。如果您正在寻找 2015 年读取、写入和修改 Excel 文件的软件包,请选择 openxlsx 而不是 xlsx

2015 年之前: 我使用过 xlsx。它改变了我使用 Excel 和 R 的工作流程。不再有烦人的弹出窗口询问我是否确定要以 .txt 格式保存 Excel 工作表。该包还写入 Excel 文件。

但是,我发现打开大型 Excel 文件时 read.xlsx 函数速度很慢。 read.xlsx2 函数速度相当快,但不会查询 data.frame 列的向量类。如果您使用read.xlsx2函数,则必须使用colClasses命令来指定所需的列类。下面是一个实际示例:

read.xlsx("filename.xlsx", 1) 读取文件并使 data.frame 列类几乎有用,但对于大型数据集来说速度非常慢。也适用于 .xls 文件。

read.xlsx2("filename.xlsx", 1) 速度更快,但您必须手动定义列类。快捷方式是运行该命令两次(请参见下面的示例)。 character 规范将您的列转换为因子。使用 DatePOSIXct 选项来设置时间。

coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <- "col.number"; return(y)} # A function to see column numbers

data <- read.xlsx2("filename.xlsx", 1) # Open the file 

coln(data)    # Check the column numbers you want to have as factors

x <- 3 # Say you want columns 1-3 as factors, the rest numeric

data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))

EDIT 2015-October: As others have commented here the openxlsx and readxl packages are by far faster than the xlsx package and actually manage to open larger Excel files (>1500 rows & > 120 columns). @MichaelChirico demonstrates that readxl is better when speed is preferred and openxlsx replaces the functionality provided by the xlsx package. If you are looking for a package to read, write, and modify Excel files in 2015, pick the openxlsx instead of xlsx.

Pre-2015: I have used xlsxpackage. It changed my workflow with Excel and R. No more annoying pop-ups asking, if I am sure that I want to save my Excel sheet in .txt format. The package also writes Excel files.

However, I find read.xlsx function slow, when opening large Excel files. read.xlsx2 function is considerably faster, but does not quess the vector class of data.frame columns. You have to use colClasses command to specify desired column classes, if you use read.xlsx2 function. Here is a practical example:

read.xlsx("filename.xlsx", 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets. Works also for .xls files.

read.xlsx2("filename.xlsx", 1) is faster, but you will have to define column classes manually. A shortcut is to run the command twice (see the example below). character specification converts your columns to factors. Use Dateand POSIXct options for time.

coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <- "col.number"; return(y)} # A function to see column numbers

data <- read.xlsx2("filename.xlsx", 1) # Open the file 

coln(data)    # Check the column numbers you want to have as factors

x <- 3 # Say you want columns 1-3 as factors, the rest numeric

data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))
春庭雪 2024-11-15 20:14:00

鉴于在 R 中读取 Excel 文件的不同方法的激增以及这里的大量答案,我想我应该尝试阐明这里提到的哪个选项表现最好(在一些简单的情况)。

自从我开始使用 R 以来,我自己就一直在使用 xlsx ,如果没有别的原因,只是为了惯性,而且我最近注意到似乎没有任何关于哪个包有效的客观信息更好的。

任何基准测试工作都充满困难,因为某些软件包肯定比其他软件包能更好地处理某些情况,并且还有大量其他警告。

也就是说,我正在使用一个(可重现的)数据集,我认为该数据集的格式非常常见(8 个字符串字段、3 个数字、1 个整数、3 个日期):

set.seed(51423)
data.frame(
  str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
  str2 = sample(sprintf("%09d", 1:NN)),  #ID field 2
  #varying length string field--think names/addresses, etc.
  str3 = 
    replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
                         collapse = "")),
  #factor-like string field with 50 "levels"
  str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
  #factor-like string field with 17 levels, varying length
  str5 = 
    sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
                                 collapse = "")), NN, TRUE),
  #lognormally distributed numeric
  num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
  #3 binary strings
  str6 = sample(c("Y","N"), NN, TRUE),
  str7 = sample(c("M","F"), NN, TRUE),
  str8 = sample(c("B","W"), NN, TRUE),
  #right-skewed integer
  int1 = ceiling(rexp(NN)),
  #dates by month
  dat1 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  dat2 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
  #date by day
  dat3 = 
    sample(seq(from = as.Date("2015-06-01"), 
               to = as.Date("2015-07-15"), by = "day"),
           NN, TRUE),
  #lognormal numeric that can be positive or negative
  num3 = 
    (-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)

然后我将其写入 csv 并在 LibreOffice 中打开并保存作为 .xlsx 文件,然后对本线程中提到的 4 个软件包进行基准测试:xlsxopenxlsxreadxlgdata< /code>,使用默认选项(我也尝试了是否指定列类型的版本,但这并没有改变排名)。

我排除了 RODBC 因为我使用的是 Linux; XLConnect 因为它的主要目的似乎不是读取单个 Excel 工作表,而是导入整个 Excel 工作簿,因此仅依靠其读取功能来竞争似乎不公平;和 xlsReadWrite ,因为它不再与我的 R 版本兼容(似乎已被淘汰)。

然后,我使用 NN=1000LNN=25000L 运行基准测试(在上面的 data.frame 每个声明之前重置种子)以允许Excel 文件大小的差异。 gc 主要用于 xlsx,我发现它有时会造成内存堵塞。言归正传,以下是我发现的结果:

1,000 行 Excel 文件

benchmark1k <-
  microbenchmark(times = 100L,
                 xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
                 openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
                 readxl = {readxl::read_excel(fl); invisible(gc())},
                 gdata = {gdata::read.xls(fl); invisible(gc())})

# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#      xlsx  194.1958  199.2662  214.1512  201.9063  212.7563  354.0327   100
#  openxlsx  142.2074  142.9028  151.9127  143.7239  148.0940  255.0124   100
#    readxl  122.0238  122.8448  132.4021  123.6964  130.2881  214.5138   100
#     gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345   100

因此,readxl 是获胜者,openxlsx 具有竞争力,而 gdata 则明显失败者。相对于列最小值进行每个测量:

#       expr   min    lq  mean median    uq   max
# 1     xlsx  1.59  1.62  1.62   1.63  1.63  1.65
# 2 openxlsx  1.17  1.16  1.15   1.16  1.14  1.19
# 3   readxl  1.00  1.00  1.00   1.00  1.00  1.00
# 4    gdata 16.43 16.62 15.77  16.67 16.25 11.31

我们看到我自己最喜欢的 xlsxreadxl 慢 60%。

25,000 行 Excel 文件

由于需要花费大量时间,我只对较大的文件重复了 20 次,否则命令是相同的。这是原始数据:

# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval
#      xlsx  4451.9553  4539.4599  4738.6366  4762.1768  4941.2331  5091.0057    20
#  openxlsx   962.1579   981.0613   988.5006   986.1091   992.6017  1040.4158    20
#    readxl   341.0006   344.8904   347.0779   346.4518   348.9273   360.1808    20
#     gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826    20

这是相对数据:

#       expr    min     lq   mean median     uq    max
# 1     xlsx  13.06  13.16  13.65  13.75  14.16  14.13
# 2 openxlsx   2.82   2.84   2.85   2.85   2.84   2.89
# 3   readxl   1.00   1.00   1.00   1.00   1.00   1.00
# 4    gdata 128.62 128.67 129.22 129.86 129.69 126.75

因此,在速度方面,readxl 是明显的赢家。 gdata 最好有其他办法,因为读取 Excel 文件的速度非常慢,而且对于较大的表,这个问题只会加剧。

openxlsx 的两个优点是 1) 其广泛的其他方法(readxl 被设计为做一件事,这可能是它如此重要的部分原因快),尤其是它的 write.xlsx 函数,以及 2)(对于 readxl 来说更多的缺点)readxl 中的 col_types 参数 仅(截至撰写本文时)接受一些非标准 R"text" 而不是 "character""日期” 而不是“日期”

Given the proliferation of different ways to read an Excel file in R and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).

I myself have been using xlsx since I started using R, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.

Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.

That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):

set.seed(51423)
data.frame(
  str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
  str2 = sample(sprintf("%09d", 1:NN)),  #ID field 2
  #varying length string field--think names/addresses, etc.
  str3 = 
    replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
                         collapse = "")),
  #factor-like string field with 50 "levels"
  str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
  #factor-like string field with 17 levels, varying length
  str5 = 
    sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
                                 collapse = "")), NN, TRUE),
  #lognormally distributed numeric
  num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
  #3 binary strings
  str6 = sample(c("Y","N"), NN, TRUE),
  str7 = sample(c("M","F"), NN, TRUE),
  str8 = sample(c("B","W"), NN, TRUE),
  #right-skewed integer
  int1 = ceiling(rexp(NN)),
  #dates by month
  dat1 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  dat2 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
  #date by day
  dat3 = 
    sample(seq(from = as.Date("2015-06-01"), 
               to = as.Date("2015-07-15"), by = "day"),
           NN, TRUE),
  #lognormal numeric that can be positive or negative
  num3 = 
    (-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)

I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx, openxlsx, readxl, and gdata, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).

I'm excluding RODBC because I'm on Linux; XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite because it is no longer compatible with my version of R (seems to have been phased out).

I then ran benchmarks with NN=1000L and NN=25000L (resetting the seed before each declaration of the data.frame above) to allow for differences with respect to Excel file size. gc is primarily for xlsx, which I've found at times can create memory clogs. Without further ado, here are the results I found:

1,000-Row Excel File

benchmark1k <-
  microbenchmark(times = 100L,
                 xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
                 openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
                 readxl = {readxl::read_excel(fl); invisible(gc())},
                 gdata = {gdata::read.xls(fl); invisible(gc())})

# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#      xlsx  194.1958  199.2662  214.1512  201.9063  212.7563  354.0327   100
#  openxlsx  142.2074  142.9028  151.9127  143.7239  148.0940  255.0124   100
#    readxl  122.0238  122.8448  132.4021  123.6964  130.2881  214.5138   100
#     gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345   100

So readxl is the winner, with openxlsx competitive and gdata a clear loser. Taking each measure relative to the column minimum:

#       expr   min    lq  mean median    uq   max
# 1     xlsx  1.59  1.62  1.62   1.63  1.63  1.65
# 2 openxlsx  1.17  1.16  1.15   1.16  1.14  1.19
# 3   readxl  1.00  1.00  1.00   1.00  1.00  1.00
# 4    gdata 16.43 16.62 15.77  16.67 16.25 11.31

We see my own favorite, xlsx is 60% slower than readxl.

25,000-Row Excel File

Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:

# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval
#      xlsx  4451.9553  4539.4599  4738.6366  4762.1768  4941.2331  5091.0057    20
#  openxlsx   962.1579   981.0613   988.5006   986.1091   992.6017  1040.4158    20
#    readxl   341.0006   344.8904   347.0779   346.4518   348.9273   360.1808    20
#     gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826    20

Here's the relative data:

#       expr    min     lq   mean median     uq    max
# 1     xlsx  13.06  13.16  13.65  13.75  14.16  14.13
# 2 openxlsx   2.82   2.84   2.85   2.85   2.84   2.89
# 3   readxl   1.00   1.00   1.00   1.00   1.00   1.00
# 4    gdata 128.62 128.67 129.22 129.86 129.69 126.75

So readxl is the clear winner when it comes to speed. gdata better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.

Two draws of openxlsx are 1) its extensive other methods (readxl is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx function, and 2) (more of a drawback for readxl) the col_types argument in readxl only (as of this writing) accepts some nonstandard R: "text" instead of "character" and "date" instead of "Date".

往事风中埋 2024-11-15 20:14:00
library(RODBC)
file.name <- "file.xls"
sheet.name <- "Sheet Name"

## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(file.name)
dat <- sqlFetch(excel.connect, sheet.name, na.strings=c("","-"))
odbcClose(excel.connect)

就我个人而言,我喜欢 RODBC 并且可以推荐它。

library(RODBC)
file.name <- "file.xls"
sheet.name <- "Sheet Name"

## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(file.name)
dat <- sqlFetch(excel.connect, sheet.name, na.strings=c("","-"))
odbcClose(excel.connect)

Personally, I like RODBC and can recommend it.

从来不烧饼 2024-11-15 20:14:00

今天刚刚尝试了 openxlsx 包。它运行得非常好(而且速度很快)。

http://cran.r-project.org/web/packages/openxlsx /index.html

Just gave the package openxlsx a try today. It worked really well (and fast).

http://cran.r-project.org/web/packages/openxlsx/index.html

笑着哭最痛 2024-11-15 20:14:00

另一个解决方案是 xlsReadWrite 包,它不需要额外的安装,但需要您在第一次使用它之前下载额外的 shlib:

require(xlsReadWrite)
xls.getshlib()

忘记这一点可能会导致彻底的挫败感。去过那里等等...

旁注:您可能需要考虑转换为基于文本的格式(例如 csv)并从那里读取。这有多种原因:

  • 无论您使用什么解决方案(RODBC、gdata、xlsReadWrite),当您的数据转换时,都可能会发生一些奇怪的事情。尤其是日期可能相当麻烦。 HFWutils 包有一些工具可以处理 EXCEL 日期(根据 @Ben Bolker 的评论)。

  • 如果您有大工作表,则在文本文件中读取比从 EXCEL 中读取速度快。

  • 对于 .xls 和 .xlsx 文件,可能需要不同的解决方案。 EG xlsReadWrite 包目前不支持 .xlsx AFAIK。 gdata 要求您安装额外的 perl 库以支持 .xlsx。 xlsx 包可以处理以下扩展相同的名称。

Another solution is the xlsReadWrite package, which doesn't require additional installs but does require you download the additional shlib before you use it the first time by :

require(xlsReadWrite)
xls.getshlib()

Forgetting this can cause utter frustration. Been there and all that...

On a sidenote : You might want to consider converting to a text-based format (eg csv) and read in from there. This for a number of reasons :

  • whatever your solution (RODBC, gdata, xlsReadWrite) some strange things can happen when your data gets converted. Especially dates can be rather cumbersome. The HFWutils package has some tools to deal with EXCEL dates (per @Ben Bolker's comment).

  • if you have large sheets, reading in text files is faster than reading in from EXCEL.

  • for .xls and .xlsx files, different solutions might be necessary. EG the xlsReadWrite package currently does not support .xlsx AFAIK. gdata requires you to install additional perl libraries for .xlsx support. xlsx package can handle extensions of the same name.

生生不灭 2024-11-15 20:14:00

正如上面许多其他答案中所述,有许多好的软件包可以连接到 XLS/X 文件并以合理的方式获取数据。但是,应该警告您,在任何情况下都不应使用剪贴板(或 .csv)文件从 Excel 检索数据。要了解原因,请在 Excel 的单元格中输入 =1/3。现在,将您可见的小数位数减少到两位。然后将数据复制并粘贴到 R 中。现在保存 CSV。您会注意到,在这两种情况下,Excel 仅保留您通过界面可见的数据,而您已经失去了实际源数据中的所有精度。

As noted above in many of the other answers, there are many good packages that connect to the XLS/X file and get the data in a reasonable way. However, you should be warned that under no circumstances should you use the clipboard (or a .csv) file to retrieve data from Excel. To see why, enter =1/3 into a cell in excel. Now, reduce the number of decimal points visible to you to two. Then copy and paste the data into R. Now save the CSV. You'll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you've lost all of the precision in your actual source data.

痴者 2024-11-15 20:14:00

扩展@Mikko提供的答案,您可以使用一个巧妙的技巧来加快速度,而不必提前“知道”您的列类。只需使用 read.xlsx 获取有限数量的记录来确定类,然后使用 read.xlsx2

示例

# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50) 
df.real <- read.xlsx2("filename.xlsx", 1, 
                      colClasses=as.vector(sapply(df.temp, mode)))

Expanding on the answer provided by @Mikko you can use a neat trick to speed things up without having to "know" your column classes ahead of time. Simply use read.xlsx to grab a limited number of records to determine the classes and then followed it up with read.xlsx2

Example

# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50) 
df.real <- read.xlsx2("filename.xlsx", 1, 
                      colClasses=as.vector(sapply(df.temp, mode)))
魄砕の薆 2024-11-15 20:14:00

Excel 文件可以直接读入 R,如下所示:

my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)

使用 readxl 包读取 xls 和 xlxs 文件

library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")

An Excel file can be read directly into R as follows:

my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)

Reading xls and xlxs files using readxl package

library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")
清醇 2024-11-15 20:14:00

openxlsx2

对于 xlsx 文件,您可以使用 openxlsx2 包,它旨在简化/增强原始 openxlsx 包。它于 2022 年添加到 CRAN:

library(openxlsx2)

df <- read_xlsx("Book1.xlsx")

# OR 

df <- wb_read("Book1.xlsx")

您可以使用 wb_load() 保留工作簿格式、样式等:

wb <- wb_load("Book1.xlsx")
df <- wb_to_df(wb) # specify importing options

read_xlsx()wb_read()< /code> 是 wb_to_df() 的包装器

openxlsx2

For just xlsx files you can use the openxlsx2 package, which is meant to simplify/enhance the original openxlsx package. It was added to the CRAN in 2022:

library(openxlsx2)

df <- read_xlsx("Book1.xlsx")

# OR 

df <- wb_read("Book1.xlsx")

You can preserve workbook formatting, styles, etc. by using wb_load():

wb <- wb_load("Book1.xlsx")
df <- wb_to_df(wb) # specify importing options

read_xlsx() and wb_read() are wrappers of wb_to_df()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文