处理大型 xlsx 文件

发布于 2024-10-15 14:46:44 字数 444 浏览 8 评论 0原文

我需要自动调整大型（30k+ 行）xlsx 文件中的所有行。

以下代码通过 apache poi 适用于小文件，但在大文件上会出现 OutOfMemoryError：

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

更新： 不幸的是，增加堆大小不是一个选项 - OutOfMemoryError 出现在 -Xmx1024m 处，并且 30k 行不是上限。

原文

I need to auto-fit all rows in large (30k+ rows) xlsx file.

The following code via apache poi works on small files, but goes out with OutOfMemoryError on large ones:

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

Update: Unfortunately, increasing heap size is not an option - OutOfMemoryError appears at -Xmx1024m and 30k rows is not an upper limit.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

惟欲睡 2024-10-22 14:46:44

尝试使用事件 API。请参阅事件 API（仅限 HSSF）和 XSSF 和 SAX（事件 API）了解详细信息。该页面的几句话：

HSSF：

事件 API 比用户 API 新。它适用于愿意学习一些低级 API 结构的中级开发人员。它使用起来相对简单，但需要对 Excel 文件的各个部分有基本的了解（或愿意学习）。提供的优点是您可以读取内存占用相对较小的 XLS。

XSSF：

如果内存占用是一个问题，那么对于 XSSF，您可以获取底层 XML 数据，并自行处理它。这适用于愿意学习一点 .xlsx 文件的低级结构并且乐于在 java 中处理 XML 的中级开发人员。它使用起来相对简单，但需要对文件结构有基本的了解。提供的优点是您可以读取内存占用相对较小的 XLSX 文件。

对于输出，博客文章流式传输 xlsx 文件。（基本上，使用 XSSF 生成容器 XML 文件，然后将实际内容作为纯文本流式传输到 xlsx zip 存档的相应 xml 部分。）

回复收藏 0 原文

樱娆 2024-10-22 14:46:44

通过使用文件而不是流可以显着提高内存使用率。
（最好使用流 API，但流 API 有限制，请参阅 http://poi.apache.org/电子表格/index.html）

所以而不是这样

Workbook workbook = WorkbookFactory.create(inputStream);

做

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

这是根据：http://poi .apache.org/spreadsheet/quick-guide.html#FileInputStream

文件与 InputStreams

“打开工作簿（无论是 .xls HSSFWorkbook 或 .xlsx XSSFWorkbook）时，工作簿可以是从文件或输入流加载。使用文件对象可以降低内存消耗，而输入流则需要更多内存，因为它必须缓冲整个文件。”

A dramatic improvement in memory usage can be done by using a File instead of a Stream.
(It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream);

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

回复收藏 0 原文

爱本泡沫多脆弱 2024-10-22 14:46:44

我遇到了同样的问题，行数少了很多，但字符串很大。

由于我不必保持数据加载，因此我发现可以使用 SXSSF 而不是 XSSF。

它们具有相似的界面，如果您已经编写了大量代码，这会有所帮助。但使用 SXSSF 可以设置保持加载的行数。

这是链接。
http://poi.apache.org/spreadsheet/how-to.html#sxssf

回复收藏 0 原文

一紙繁鸢 2024-10-22 14:46:44

如果您想自动调整或设置样式或将所有行写入大型（30k+ 行）xlsx 文件中，请使用 SXSSFWorkbook。这是可以帮助您的示例代码...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");

If you want to auto-fit or set styles or write all rows in large (30k+ rows) xlsx file,use SXSSFWorkbook.Here is the sample code that helps you...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");

回复收藏 0 原文

鹿港小镇 2024-10-22 14:46:44

我对 HSSF 文件 (.xls) 使用了事件 API，并且发现有关记录顺序的文档严重缺乏。

回复收藏 0 原文

久夏青 2024-10-22 14:46:44

如果您写入到 XLSX，我发现通过写入同一 Excel 文件的不同工作表可以得到改进。您还可能会通过写入不同的 Excel 文件来发现改进。但首先尝试写入不同的纸张。

回复收藏 0 原文

北笙凉宸 2024-10-22 14:46:44

以下堆栈溢出线程描述了最好的示例：
通过 Apache POI 读取大型 Excel 文件 (xlsx) 时出错

该主题主要答案中的代码片段说明了 Apache POI 围绕 SAX xml 解析的包装，以及如何简单地循环遍历所有工作表，然后循环遍历每个单独的单元格。

对于 Apache POI API 的当前实现，该代码已过时，因为 endRow() api 提供已完成处理的当前行号。

使用该代码片段，您可以轻松地逐个单元地解析大型 XLSX 文件。例如对于每张纸；对于每个行单元格；行已结束事件。
您可以简单地创建应用程序逻辑，在每行的位置创建一个从 columneName 到 cellValue 的映射。

回复收藏 0 原文

诺曦 2024-10-22 14:46:44

我对 800,000 个单元和 3M 字符也遇到了同样的问题，其中 XSSF 分配了 1GB 堆！

我使用 Python 以及 openpyxl 和 numpy 来读取 xlsx 文件（来自 Java 代码），并首先将其转换为普通文本。然后我用java加载了文本文件。它可能看起来有很大的开销，但它确实很快。

python脚本看起来像

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

然后在我的java代码中，我写了

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

之后，你会得到foo.txt，它类似于foo.xlsx，但是是文本格式。

I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!

I used Python with openpyxl and numpy to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.

The python script looks like

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

Then in my java code, I wrote

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

After that, you will get foo.txt which is similar to foo.xlsx, but in text format.

回复收藏 0 原文