处理大型 xlsx 文件

发布于 2024-10-15 14:46:44 字数 444 浏览 8 评论 0原文

我需要自动调整大型(30k+ 行)xlsx 文件中的所有行。

以下代码通过 apache poi 适用于小文件,但在大文件上会出现 OutOfMemoryError

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

更新: 不幸的是,增加堆大小不是一个选项 - OutOfMemoryError 出现在 -Xmx1024m 处,并且 30k 行不是上限。

I need to auto-fit all rows in large (30k+ rows) xlsx file.

The following code via apache poi works on small files, but goes out with OutOfMemoryError on large ones:

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

Update: Unfortunately, increasing heap size is not an option - OutOfMemoryError appears at -Xmx1024m and 30k rows is not an upper limit.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

惟欲睡 2024-10-22 14:46:44

尝试使用事件 API。请参阅事件 API(仅限 HSSF)XSSF 和 SAX(事件 API) 了解详细信息。该页面的几句话:

HSSF:

事件 API 比用户 API 新。它适用于愿意学习一些低级 API 结构的中级开发人员。它使用起来相对简单,但需要对 Excel 文件的各个部分有基本的了解(或愿意学习)。提供的优点是您可以读取内存占用相对较小的 XLS。

XSSF:

如果内存占用是一个问题,那么对于 XSSF,您可以获取底层 XML 数据,并自行处理它。这适用于愿意学习一点 .xlsx 文件的低级结构并且乐于在 java 中处理 XML 的中级开发人员。它使用起来相对简单,但需要对文件结构有基本的了解。提供的优点是您可以读取内存占用相对较小的 XLSX 文件。

对于输出,博客文章 流式传输 xlsx 文件。 (基本上,使用 XSSF 生成容器 XML 文件,然后将实际内容作为纯文本流式传输到 xlsx zip 存档的相应 xml 部分。)

Try using the event API. See Event API (HSSF only) and XSSF and SAX (Event API) in the POI documentation for details. A couple of quotes from that page:

HSSF:

The event API is newer than the User API. It is intended for intermediate developers who are willing to learn a little bit of the low level API structures. Its relatively simple to use, but requires a basic understanding of the parts of an Excel file (or willingness to learn). The advantage provided is that you can read an XLS with a relatively small memory footprint.

XSSF:

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

For output, one possible approach is described in the blog post Streaming xlsx files. (Basically, use XSSF to generate a container XML file, then stream the actual content as plain text into the appropriate xml part of the xlsx zip archive.)

樱娆 2024-10-22 14:46:44

通过使用文件而不是流可以显着提高内存使用率。
(最好使用流 API,但流 API 有限制,请参阅 http://poi.apache.org/电子表格/index.html

所以而不是这样

Workbook workbook = WorkbookFactory.create(inputStream);

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

这是根据:http://poi .apache.org/spreadsheet/quick-guide.html#FileInputStream

文件与 InputStreams

“打开工作簿(无论是 .xls HSSFWorkbook 或 .xlsx XSSFWorkbook)时,工作簿可以是从文件或输入流加载。使用文件对象可以降低内存消耗,而输入流则需要更多内存,因为它必须缓冲整个文件。”

A dramatic improvement in memory usage can be done by using a File instead of a Stream.
(It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream);

do

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

爱本泡沫多脆弱 2024-10-22 14:46:44

我遇到了同样的问题,行数少了很多,但字符串很大。

由于我不必保持数据加载,因此我发现可以使用 SXSSF 而不是 XSSF。

它们具有相似的界面,如果您已经编写了大量代码,这会有所帮助。但使用 SXSSF 可以设置保持加载的行数。

这是链接。
http://poi.apache.org/spreadsheet/how-to.html#sxssf

I was having the same problem with a lot less of row, but large strings.

Since I don't have to keep my data loaded, I found out that I can use SXSSF instead of XSSF.

They have similar interfaces, which helps if you have a lot of code already writen. But with SXSSF it is possible to set the amount of rows you keep loaded.

Here is the link.
http://poi.apache.org/spreadsheet/how-to.html#sxssf

一紙繁鸢 2024-10-22 14:46:44

如果您想自动调整或设置样式或将所有行写入大型(30k+ 行)xlsx 文件中,请使用 SXSSFWorkbook。这是可以帮助您的示例代码...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");

If you want to auto-fit or set styles or write all rows in large (30k+ rows) xlsx file,use SXSSFWorkbook.Here is the sample code that helps you...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");
鹿港小镇 2024-10-22 14:46:44

我对 HSSF 文件 (.xls) 使用了事件 API,并且发现有关记录顺序的文档严重缺乏。

I used Event API for a HSSF file (.xls), and I discovered terrible lack of documentation about order of records.

久夏青 2024-10-22 14:46:44

如果您写入到 XLSX,我发现通过写入同一 Excel 文件的不同工作表可以得到改进。您还可能会通过写入不同的 Excel 文件来发现改进。但首先尝试写入不同的纸张。

If you are writing to XLSX, I found an improvement by writing to different sheets of the same Excel file. You also might find an improvement by writing to different Excel files. But first try writing to different sheets.

北笙凉宸 2024-10-22 14:46:44

以下堆栈溢出线程描述了最好的示例:
通过 Apache POI 读取大型 Excel 文件 (xlsx) 时出错

该主题主要答案中的代码片段说明了 Apache POI 围绕 SAX xml 解析的包装,以及如何简单地循环遍历所有工作表,然后循环遍历每个单独的单元格。

对于 Apache POI API 的当前实现,该代码已过时,因为 endRow() api 提供已完成处理的当前行号。

使用该代码片段,您可以轻松地逐个单元地解析大型 XLSX 文件。例如对于每张纸;对于每个行单元格;行已结束事件。
您可以简单地创建应用程序逻辑,在每行的位置创建一个从 columneName 到 cellValue 的映射。

The best example for this is described in the following stack overflow thread:
Error While Reading Large Excel Files (xlsx) Via Apache POI

The code snippet in the main answer in that topic illustrates the Apache POI wrappings around SAX xml parsing, and how you can trivially loop over all the sheets and then over each individual cell.

The code is stale with current implementation of the Apache POI API, as the endRow() api provides the current row number that has finished to be processing.

With that code snippet it should be trivial for your to parse a big XLSX file cell by cell. E.g. for each sheet; for each row cell; row has ended event.
You could trivial create app logic where at the of each row you create a Map of columneName to cellValue.

诺曦 2024-10-22 14:46:44

我对 800,000 个单元和 3M 字符也遇到了同样的问题,其中 XSSF 分配了 1GB 堆!

我使用 Python 以及 openpyxlnumpy 来读取 xlsx 文件(来自 Java 代码),并首先将其转换为普通文本。然后我用java加载了文本文件。它可能看起来有很大的开销,但它确实很快。

python脚本看起来像

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

然后在我的java代码中,我写了

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

之后,你会得到foo.txt,它类似于foo.xlsx,但是是文本格式。

I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!

I used Python with openpyxl and numpy to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.

The python script looks like

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

Then in my java code, I wrote

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

After that, you will get foo.txt which is similar to foo.xlsx, but in text format.

孤檠 2024-10-22 14:46:44

我使用 SAX 解析器来处理 XML 结构。它适用于 XLSX 文件。

https://stackoverflow.com/a/44969009/4587961

I used SAX parser to process XML structure. It works for XLSX files.

https://stackoverflow.com/a/44969009/4587961

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文