处理大型 xlsx 文件
我需要自动调整大型(30k+ 行)xlsx 文件中的所有行。
以下代码通过 apache poi 适用于小文件,但在大文件上会出现 OutOfMemoryError
:
Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
row.setHeight((short) -1);
}
workbook.write(outputStream);
更新: 不幸的是,增加堆大小不是一个选项 - OutOfMemoryError
出现在 -Xmx1024m
处,并且 30k 行不是上限。
I need to auto-fit all rows in large (30k+ rows) xlsx file.
The following code via apache poi works on small files, but goes out with OutOfMemoryError
on large ones:
Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
row.setHeight((short) -1);
}
workbook.write(outputStream);
Update: Unfortunately, increasing heap size is not an option - OutOfMemoryError
appears at -Xmx1024m
and 30k rows is not an upper limit.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
尝试使用事件 API。请参阅事件 API(仅限 HSSF) 和 XSSF 和 SAX(事件 API) 了解详细信息。该页面的几句话:
HSSF:
XSSF:
对于输出,博客文章 流式传输 xlsx 文件。 (基本上,使用 XSSF 生成容器 XML 文件,然后将实际内容作为纯文本流式传输到 xlsx zip 存档的相应 xml 部分。)
Try using the event API. See Event API (HSSF only) and XSSF and SAX (Event API) in the POI documentation for details. A couple of quotes from that page:
HSSF:
XSSF:
For output, one possible approach is described in the blog post Streaming xlsx files. (Basically, use XSSF to generate a container XML file, then stream the actual content as plain text into the appropriate xml part of the xlsx zip archive.)
通过使用文件而不是流可以显着提高内存使用率。
(最好使用流 API,但流 API 有限制,请参阅 http://poi.apache.org/电子表格/index.html)
所以而不是这样
做
这是根据:http://poi .apache.org/spreadsheet/quick-guide.html#FileInputStream
文件与 InputStreams
“打开工作簿(无论是 .xls HSSFWorkbook 或 .xlsx XSSFWorkbook)时,工作簿可以是从文件或输入流加载。使用文件对象可以降低内存消耗,而输入流则需要更多内存,因为它必须缓冲整个文件。”
A dramatic improvement in memory usage can be done by using a File instead of a Stream.
(It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)
So instead of
do
This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream
Files vs InputStreams
"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."
我遇到了同样的问题,行数少了很多,但字符串很大。
由于我不必保持数据加载,因此我发现可以使用 SXSSF 而不是 XSSF。
它们具有相似的界面,如果您已经编写了大量代码,这会有所帮助。但使用 SXSSF 可以设置保持加载的行数。
这是链接。
http://poi.apache.org/spreadsheet/how-to.html#sxssf
I was having the same problem with a lot less of row, but large strings.
Since I don't have to keep my data loaded, I found out that I can use SXSSF instead of XSSF.
They have similar interfaces, which helps if you have a lot of code already writen. But with SXSSF it is possible to set the amount of rows you keep loaded.
Here is the link.
http://poi.apache.org/spreadsheet/how-to.html#sxssf
如果您想自动调整或设置样式或将所有行写入大型(30k+ 行)xlsx 文件中,请使用 SXSSFWorkbook。这是可以帮助您的示例代码...
If you want to auto-fit or set styles or write all rows in large (30k+ rows) xlsx file,use SXSSFWorkbook.Here is the sample code that helps you...
我对 HSSF 文件 (.xls) 使用了事件 API,并且发现有关记录顺序的文档严重缺乏。
I used Event API for a HSSF file (.xls), and I discovered terrible lack of documentation about order of records.
如果您写入到 XLSX,我发现通过写入同一 Excel 文件的不同工作表可以得到改进。您还可能会通过写入不同的 Excel 文件来发现改进。但首先尝试写入不同的纸张。
If you are writing to XLSX, I found an improvement by writing to different sheets of the same Excel file. You also might find an improvement by writing to different Excel files. But first try writing to different sheets.
以下堆栈溢出线程描述了最好的示例:
通过 Apache POI 读取大型 Excel 文件 (xlsx) 时出错
该主题主要答案中的代码片段说明了 Apache POI 围绕 SAX xml 解析的包装,以及如何简单地循环遍历所有工作表,然后循环遍历每个单独的单元格。
对于 Apache POI API 的当前实现,该代码已过时,因为 endRow() api 提供已完成处理的当前行号。
使用该代码片段,您可以轻松地逐个单元地解析大型 XLSX 文件。例如对于每张纸;对于每个行单元格;行已结束事件。
您可以简单地创建应用程序逻辑,在每行的位置创建一个从 columneName 到 cellValue 的映射。
The best example for this is described in the following stack overflow thread:
Error While Reading Large Excel Files (xlsx) Via Apache POI
The code snippet in the main answer in that topic illustrates the Apache POI wrappings around SAX xml parsing, and how you can trivially loop over all the sheets and then over each individual cell.
The code is stale with current implementation of the Apache POI API, as the endRow() api provides the current row number that has finished to be processing.
With that code snippet it should be trivial for your to parse a big XLSX file cell by cell. E.g. for each sheet; for each row cell; row has ended event.
You could trivial create app logic where at the of each row you create a Map of columneName to cellValue.
我对 800,000 个单元和 3M 字符也遇到了同样的问题,其中 XSSF 分配了 1GB 堆!
我使用 Python 以及
openpyxl
和numpy
来读取 xlsx 文件(来自 Java 代码),并首先将其转换为普通文本。然后我用java加载了文本文件。它可能看起来有很大的开销,但它确实很快。python脚本看起来像
然后在我的java代码中,我写了
之后,你会得到foo.txt,它类似于foo.xlsx,但是是文本格式。
I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!
I used Python with
openpyxl
andnumpy
to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.The python script looks like
Then in my java code, I wrote
After that, you will get foo.txt which is similar to foo.xlsx, but in text format.
我使用 SAX 解析器来处理 XML 结构。它适用于 XLSX 文件。
https://stackoverflow.com/a/44969009/4587961
I used SAX parser to process XML structure. It works for XLSX files.
https://stackoverflow.com/a/44969009/4587961