大型电子表格的 Apache POI Java Excel 性能
我有一个电子表格,我试图用 POI 读取(我有 xls 和 xlsx 格式),但在这种情况下,问题出在 xls 文件上。我的电子表格大约有 10,000 行和 75 列,读取它可能需要几分钟的时间(尽管 Excel 在几秒钟内打开)。我使用基于事件的读取,而不是将整个文件读入内存。我的代码的核心内容如下。现在有点混乱,但它实际上只是一个很长的 switch 语句,大部分是从 POI 示例中复制的。
使用事件模型的 POI 性能是否通常如此缓慢?我可以做些什么来加快速度吗?我认为几分钟对于我的申请来说是不可接受的。
POIFSFileSystem poifs = new POIFSFileSystem(fis);
InputStream din = poifs.createDocumentInputStream("Workbook");
try
{
HSSFRequest req = new HSSFRequest();
listener = new FormatTrackingHSSFListener(new HSSFListener() {
@Override
public void processRecord(Record rec)
{
thisString = null;
int sid = rec.getSid();
switch (sid)
{
case SSTRecord.sid:
strTable = (SSTRecord) rec;
break;
case LabelSSTRecord.sid:
LabelSSTRecord labelSstRec = (LabelSSTRecord) rec;
thisString = strTable.getString(labelSstRec
.getSSTIndex()).getString();
row = labelSstRec.getRow();
col = labelSstRec.getColumn();
break;
case RKRecord.sid:
RKRecord rrk = (RKRecord) rec;
thisString = "";
row = rrk.getRow();
col = rrk.getColumn();
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) rec;
thisString = lrec.getValue();
row = lrec.getRow();
col = lrec.getColumn();
break;
case BlankRecord.sid:
BlankRecord blrec = (BlankRecord) rec;
thisString = "";
row = blrec.getRow();
col = blrec.getColumn();
break;
case BoolErrRecord.sid:
BoolErrRecord berec = (BoolErrRecord) rec;
row = berec.getRow();
col = berec.getColumn();
byte errVal = berec.getErrorValue();
thisString = errVal == 0 ? Boolean.toString(berec
.getBooleanValue()) : ErrorConstants
.getText(errVal);
break;
case FormulaRecord.sid:
FormulaRecord frec = (FormulaRecord) rec;
switch (frec.getCachedResultType())
{
case Cell.CELL_TYPE_NUMERIC:
double num = frec.getValue();
if (Double.isNaN(num))
{
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
}
else
{
thisString = formatNumericValue(frec, num);
}
break;
case Cell.CELL_TYPE_BOOLEAN:
thisString = Boolean.toString(frec
.getCachedBooleanValue());
break;
case Cell.CELL_TYPE_ERROR:
thisString = HSSFErrorConstants
.getText(frec.getCachedErrorValue());
break;
case Cell.CELL_TYPE_STRING:
outputNextStringRecord = true;
break;
}
row = frec.getRow();
col = frec.getColumn();
break;
case StringRecord.sid:
if (outputNextStringRecord)
{
// String for formula
StringRecord srec = (StringRecord) rec;
thisString = srec.getString();
outputNextStringRecord = false;
}
break;
case NumberRecord.sid:
NumberRecord numRec = (NumberRecord) rec;
row = numRec.getRow();
col = numRec.getColumn();
thisString = formatNumericValue(numRec, numRec
.getValue());
break;
case NoteRecord.sid:
NoteRecord noteRec = (NoteRecord) rec;
row = noteRec.getRow();
col = noteRec.getColumn();
thisString = "";
break;
case EOFRecord.sid:
inSheet = false;
}
if (thisString != null)
{
// do something with the cell value
}
}
});
req.addListenerForAllRecords(listener);
HSSFEventFactory factory = new HSSFEventFactory();
factory.processEvents(req, din);
I have a spreadsheet I'm trying to read with POI (I have both xls and xlsx formats), but in this case, the problem is with the xls file. My spreadsheet has about 10,000 rows and 75 columns, and reading it in can take several minutes (though Excel opens in a few seconds). I'm using the event based reading, rather than reading the whole file into memory. The meat of my code is below. It's a bit messy right now, but it's really just a long switch statement that was mostly copied from the POI examples.
Is it typical for POI performance using the event model to be so slow? Is there anything I an do to speed this up? I think several minutes will be unacceptable for my application.
POIFSFileSystem poifs = new POIFSFileSystem(fis);
InputStream din = poifs.createDocumentInputStream("Workbook");
try
{
HSSFRequest req = new HSSFRequest();
listener = new FormatTrackingHSSFListener(new HSSFListener() {
@Override
public void processRecord(Record rec)
{
thisString = null;
int sid = rec.getSid();
switch (sid)
{
case SSTRecord.sid:
strTable = (SSTRecord) rec;
break;
case LabelSSTRecord.sid:
LabelSSTRecord labelSstRec = (LabelSSTRecord) rec;
thisString = strTable.getString(labelSstRec
.getSSTIndex()).getString();
row = labelSstRec.getRow();
col = labelSstRec.getColumn();
break;
case RKRecord.sid:
RKRecord rrk = (RKRecord) rec;
thisString = "";
row = rrk.getRow();
col = rrk.getColumn();
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) rec;
thisString = lrec.getValue();
row = lrec.getRow();
col = lrec.getColumn();
break;
case BlankRecord.sid:
BlankRecord blrec = (BlankRecord) rec;
thisString = "";
row = blrec.getRow();
col = blrec.getColumn();
break;
case BoolErrRecord.sid:
BoolErrRecord berec = (BoolErrRecord) rec;
row = berec.getRow();
col = berec.getColumn();
byte errVal = berec.getErrorValue();
thisString = errVal == 0 ? Boolean.toString(berec
.getBooleanValue()) : ErrorConstants
.getText(errVal);
break;
case FormulaRecord.sid:
FormulaRecord frec = (FormulaRecord) rec;
switch (frec.getCachedResultType())
{
case Cell.CELL_TYPE_NUMERIC:
double num = frec.getValue();
if (Double.isNaN(num))
{
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
}
else
{
thisString = formatNumericValue(frec, num);
}
break;
case Cell.CELL_TYPE_BOOLEAN:
thisString = Boolean.toString(frec
.getCachedBooleanValue());
break;
case Cell.CELL_TYPE_ERROR:
thisString = HSSFErrorConstants
.getText(frec.getCachedErrorValue());
break;
case Cell.CELL_TYPE_STRING:
outputNextStringRecord = true;
break;
}
row = frec.getRow();
col = frec.getColumn();
break;
case StringRecord.sid:
if (outputNextStringRecord)
{
// String for formula
StringRecord srec = (StringRecord) rec;
thisString = srec.getString();
outputNextStringRecord = false;
}
break;
case NumberRecord.sid:
NumberRecord numRec = (NumberRecord) rec;
row = numRec.getRow();
col = numRec.getColumn();
thisString = formatNumericValue(numRec, numRec
.getValue());
break;
case NoteRecord.sid:
NoteRecord noteRec = (NoteRecord) rec;
row = noteRec.getRow();
col = noteRec.getColumn();
thisString = "";
break;
case EOFRecord.sid:
inSheet = false;
}
if (thisString != null)
{
// do something with the cell value
}
}
});
req.addListenerForAllRecords(listener);
HSSFEventFactory factory = new HSSFEventFactory();
factory.processEvents(req, din);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您使用 Apache POI 生成大型 Excel 文件,请注意以下行:
sheet.autoSizeColumn((short) p);
因为这会降低性能。
f you are using Apache POI to generate large excel file, please take note the following line :
sheet.autoSizeColumn((short) p);
Because this will degrade the performance.
我还对数千个大型 Excel 文件进行了一些处理,在我看来 POI 非常快。在 Excel 中加载该 Excel 文件也需要大约 1 分钟。所以我确认问题出在 POI 代码上
I did also some processing with thousands of large excel files and in my opinion POI is very fast. Loading that excel files tooks also about 1 minute in Excel itself. So i would confirm that the problem lies out of POI code
我会尝试使用 poi-beta3 中引入的流式 hssf。这有助于解决具有 1000 多个列的大型电子表格的内存问题。
I would attempt to use the streaming hssf as well introduced in poi-beta3. This helped the memory issues on large spreadsheets with 1000+ columns.
如果您使用 Apache POI 生成大型 Excel 文件,请注意sheet.autoSizeColumn((short) p);线,因为这会影响性能。
http://stanicblog.blogspot.sg/2013 /07/generate-large-excel-report-by-using.html
If you are using Apache POI to generate large excel file, please take note the sheet.autoSizeColumn((short) p); line because this will impact the performance.
http://stanicblog.blogspot.sg/2013/07/generate-large-excel-report-by-using.html
我做了一些更详细的分析,看起来问题实际上出在 POI 之外的代码中。我只是认为这是瓶颈,但我认为这是不正确的。
I did some more detailed profiling and it looks like the problem is actually in code outside of POI. I just assumed this was the bottleneck, but I believe this is incorrect.