直接从 R 脚本读取 Excel 文件
如何将 Excel 文件直接读取到 R 中?或者我应该首先将数据导出到文本或 CSV 文件,然后将该文件导入到 R 中?
How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
让我重申 @Chase 的建议:使用 XLConnect。
我认为使用 XLConnect 的原因是:
与其他解决方案相比,XLConnect 有点新,因此在博客文章和参考文档中很少提及。对我来说这非常有用。
Let me reiterate what @Chase recommended: Use XLConnect.
The reasons for using XLConnect are, in my opinion:
XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.
现在有 readxl:
用法
请注意,虽然描述说“无外部依赖项”,但它确实需要
Rcpp
包,这又需要 Rtools(对于 Windows)或 Xcode(对于OSX),它们是 R 外部的依赖项。尽管许多人出于其他原因安装了它们。And now there is readxl:
Usage
Note that while the description says 'no external dependencies', it does require the
Rcpp
package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. Though many people have them installed for other reasons.是的。请参阅R wiki 上的相关页面。简短的回答:来自
gdata
包的read.xls
在大多数情况下都有效(尽管您需要在系统上安装 Perl——通常在 MacOS 和 Linux 上已经如此,但在 Windows 上需要采取额外的步骤,即参见 http://strawberryperl.com/)。 R wiki 页面上列出了各种警告和替代方案。我认为不直接执行此操作的唯一原因是您可能想要检查电子表格以查看它是否有故障(奇怪的标题,多个工作表[您一次只能读取一个,尽管您显然可以循环遍历所有工作表] ,包括绘图等)。但是对于一个格式良好、带有普通数字和字符数据的矩形电子表格(即,不是逗号格式的数字、日期、被零除错误的公式、缺失值等......)我通常没有问题随着这个过程。
Yes. See the relevant page on the R wiki. Short answer:
read.xls
from thegdata
package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see http://strawberryperl.com/). There are various caveats, and alternatives, listed on the R wiki page.The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.
编辑 2015 年 10 月: 正如其他人在这里评论的那样
openxlsx
和readxl
软件包比xlsx
软件包快得多,并且实际上能够打开更大的 Excel 文件(>1500 行 > ; 120 列)。 @MichaelChirico 演示了当速度优先时readxl
更好,并且openxlsx
取代了xlsx
包提供的功能。如果您正在寻找 2015 年读取、写入和修改 Excel 文件的软件包,请选择openxlsx
而不是xlsx
。2015 年之前: 我使用过
xlsx
包。它改变了我使用 Excel 和 R 的工作流程。不再有烦人的弹出窗口询问我是否确定要以 .txt 格式保存 Excel 工作表。该包还写入 Excel 文件。但是,我发现打开大型 Excel 文件时
read.xlsx
函数速度很慢。read.xlsx2
函数速度相当快,但不会查询 data.frame 列的向量类。如果您使用read.xlsx2
函数,则必须使用colClasses
命令来指定所需的列类。下面是一个实际示例:read.xlsx("filename.xlsx", 1)
读取文件并使 data.frame 列类几乎有用,但对于大型数据集来说速度非常慢。也适用于.xls
文件。read.xlsx2("filename.xlsx", 1)
速度更快,但您必须手动定义列类。快捷方式是运行该命令两次(请参见下面的示例)。character
规范将您的列转换为因子。使用Date
和POSIXct
选项来设置时间。EDIT 2015-October: As others have commented here the
openxlsx
andreadxl
packages are by far faster than thexlsx
package and actually manage to open larger Excel files (>1500 rows & > 120 columns). @MichaelChirico demonstrates thatreadxl
is better when speed is preferred andopenxlsx
replaces the functionality provided by thexlsx
package. If you are looking for a package to read, write, and modify Excel files in 2015, pick theopenxlsx
instead ofxlsx
.Pre-2015: I have used
xlsx
package. It changed my workflow with Excel and R. No more annoying pop-ups asking, if I am sure that I want to save my Excel sheet in .txt format. The package also writes Excel files.However, I find
read.xlsx
function slow, when opening large Excel files.read.xlsx2
function is considerably faster, but does not quess the vector class of data.frame columns. You have to usecolClasses
command to specify desired column classes, if you useread.xlsx2
function. Here is a practical example:read.xlsx("filename.xlsx", 1)
reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets. Works also for.xls
files.read.xlsx2("filename.xlsx", 1)
is faster, but you will have to define column classes manually. A shortcut is to run the command twice (see the example below).character
specification converts your columns to factors. UseDate
andPOSIXct
options for time.鉴于在
R
中读取 Excel 文件的不同方法的激增以及这里的大量答案,我想我应该尝试阐明这里提到的哪个选项表现最好(在一些简单的情况)。自从我开始使用
R
以来,我自己就一直在使用xlsx
,如果没有别的原因,只是为了惯性,而且我最近注意到似乎没有任何关于哪个包有效的客观信息更好的。任何基准测试工作都充满困难,因为某些软件包肯定比其他软件包能更好地处理某些情况,并且还有大量其他警告。
也就是说,我正在使用一个(可重现的)数据集,我认为该数据集的格式非常常见(8 个字符串字段、3 个数字、1 个整数、3 个日期):
然后我将其写入 csv 并在 LibreOffice 中打开并保存作为 .xlsx 文件,然后对本线程中提到的 4 个软件包进行基准测试:
xlsx
、openxlsx
、readxl
和gdata< /code>,使用默认选项(我也尝试了是否指定列类型的版本,但这并没有改变排名)。
我排除了
RODBC
因为我使用的是 Linux;XLConnect
因为它的主要目的似乎不是读取单个 Excel 工作表,而是导入整个 Excel 工作簿,因此仅依靠其读取功能来竞争似乎不公平;和 xlsReadWrite ,因为它不再与我的 R 版本兼容(似乎已被淘汰)。然后,我使用
NN=1000L
和NN=25000L
运行基准测试(在上面的data.frame
每个声明之前重置种子)以允许Excel 文件大小的差异。gc
主要用于xlsx
,我发现它有时会造成内存堵塞。言归正传,以下是我发现的结果:1,000 行 Excel 文件
因此,
readxl
是获胜者,openxlsx
具有竞争力,而gdata
则明显失败者。相对于列最小值进行每个测量:我们看到我自己最喜欢的
xlsx
比readxl
慢 60%。25,000 行 Excel 文件
由于需要花费大量时间,我只对较大的文件重复了 20 次,否则命令是相同的。这是原始数据:
这是相对数据:
因此,在速度方面,
readxl
是明显的赢家。gdata
最好有其他办法,因为读取 Excel 文件的速度非常慢,而且对于较大的表,这个问题只会加剧。openxlsx
的两个优点是 1) 其广泛的其他方法(readxl
被设计为只做一件事,这可能是它如此重要的部分原因快),尤其是它的 write.xlsx 函数,以及 2)(对于 readxl 来说更多的缺点)readxl 中的 col_types 参数 仅(截至撰写本文时)接受一些非标准R
:"text"
而不是"character"
和"日期”
而不是“日期”
。Given the proliferation of different ways to read an Excel file in
R
and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).I myself have been using
xlsx
since I started usingR
, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.
That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):
I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread:
xlsx
,openxlsx
,readxl
, andgdata
, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).I'm excluding
RODBC
because I'm on Linux;XLConnect
because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; andxlsReadWrite
because it is no longer compatible with my version ofR
(seems to have been phased out).I then ran benchmarks with
NN=1000L
andNN=25000L
(resetting the seed before each declaration of thedata.frame
above) to allow for differences with respect to Excel file size.gc
is primarily forxlsx
, which I've found at times can create memory clogs. Without further ado, here are the results I found:1,000-Row Excel File
So
readxl
is the winner, withopenxlsx
competitive andgdata
a clear loser. Taking each measure relative to the column minimum:We see my own favorite,
xlsx
is 60% slower thanreadxl
.25,000-Row Excel File
Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:
Here's the relative data:
So
readxl
is the clear winner when it comes to speed.gdata
better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.Two draws of
openxlsx
are 1) its extensive other methods (readxl
is designed to do only one thing, which is probably part of why it's so fast), especially itswrite.xlsx
function, and 2) (more of a drawback forreadxl
) thecol_types
argument inreadxl
only (as of this writing) accepts some nonstandardR
:"text"
instead of"character"
and"date"
instead of"Date"
.我在
XLConnect
方面运气不错:http://cran. r-project.org/web/packages/XLConnect/index.htmlI've had good luck with
XLConnect
: http://cran.r-project.org/web/packages/XLConnect/index.html就我个人而言,我喜欢 RODBC 并且可以推荐它。
Personally, I like RODBC and can recommend it.
今天刚刚尝试了
openxlsx
包。它运行得非常好(而且速度很快)。http://cran.r-project.org/web/packages/openxlsx /index.html
Just gave the package
openxlsx
a try today. It worked really well (and fast).http://cran.r-project.org/web/packages/openxlsx/index.html
另一个解决方案是
xlsReadWrite
包,它不需要额外的安装,但需要您在第一次使用它之前下载额外的 shlib:忘记这一点可能会导致彻底的挫败感。去过那里等等...
旁注:您可能需要考虑转换为基于文本的格式(例如 csv)并从那里读取。这有多种原因:
无论您使用什么解决方案(RODBC、gdata、xlsReadWrite),当您的数据转换时,都可能会发生一些奇怪的事情。尤其是日期可能相当麻烦。
HFWutils
包有一些工具可以处理 EXCEL 日期(根据 @Ben Bolker 的评论)。如果您有大工作表,则在文本文件中读取比从 EXCEL 中读取速度快。
对于 .xls 和 .xlsx 文件,可能需要不同的解决方案。 EG xlsReadWrite 包目前不支持 .xlsx AFAIK。
gdata
要求您安装额外的 perl 库以支持 .xlsx。xlsx
包可以处理以下扩展相同的名称。Another solution is the
xlsReadWrite
package, which doesn't require additional installs but does require you download the additional shlib before you use it the first time by :Forgetting this can cause utter frustration. Been there and all that...
On a sidenote : You might want to consider converting to a text-based format (eg csv) and read in from there. This for a number of reasons :
whatever your solution (RODBC, gdata, xlsReadWrite) some strange things can happen when your data gets converted. Especially dates can be rather cumbersome. The
HFWutils
package has some tools to deal with EXCEL dates (per @Ben Bolker's comment).if you have large sheets, reading in text files is faster than reading in from EXCEL.
for .xls and .xlsx files, different solutions might be necessary. EG the xlsReadWrite package currently does not support .xlsx AFAIK.
gdata
requires you to install additional perl libraries for .xlsx support.xlsx
package can handle extensions of the same name.正如上面许多其他答案中所述,有许多好的软件包可以连接到 XLS/X 文件并以合理的方式获取数据。但是,应该警告您,在任何情况下都不应使用剪贴板(或 .csv)文件从 Excel 检索数据。要了解原因,请在 Excel 的单元格中输入
=1/3
。现在,将您可见的小数位数减少到两位。然后将数据复制并粘贴到 R 中。现在保存 CSV。您会注意到,在这两种情况下,Excel 仅保留您通过界面可见的数据,而您已经失去了实际源数据中的所有精度。As noted above in many of the other answers, there are many good packages that connect to the XLS/X file and get the data in a reasonable way. However, you should be warned that under no circumstances should you use the clipboard (or a .csv) file to retrieve data from Excel. To see why, enter
=1/3
into a cell in excel. Now, reduce the number of decimal points visible to you to two. Then copy and paste the data into R. Now save the CSV. You'll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you've lost all of the precision in your actual source data.扩展@Mikko提供的答案,您可以使用一个巧妙的技巧来加快速度,而不必提前“知道”您的列类。只需使用
read.xlsx
获取有限数量的记录来确定类,然后使用read.xlsx2
示例
Expanding on the answer provided by @Mikko you can use a neat trick to speed things up without having to "know" your column classes ahead of time. Simply use
read.xlsx
to grab a limited number of records to determine the classes and then followed it up withread.xlsx2
Example
Excel 文件可以直接读入 R,如下所示:
使用 readxl 包读取 xls 和 xlxs 文件
An Excel file can be read directly into R as follows:
Reading xls and xlxs files using readxl package
openxlsx2
对于 xlsx 文件,您可以使用 openxlsx2 包,它旨在简化/增强原始 openxlsx 包。它于 2022 年添加到 CRAN:
您可以使用
wb_load()
保留工作簿格式、样式等:read_xlsx()
和wb_read()< /code> 是
wb_to_df()
的包装器openxlsx2
For just xlsx files you can use the openxlsx2 package, which is meant to simplify/enhance the original openxlsx package. It was added to the CRAN in 2022:
You can preserve workbook formatting, styles, etc. by using
wb_load()
:read_xlsx()
andwb_read()
are wrappers ofwb_to_df()