如何使用 POI 解析 Excel 文件中的 UTF-8 字符
我一直在使用 POI 成功解析 XLS 和 XLSX 文件。但是,我无法从 Excel 电子表格中正确提取特殊字符,例如中文或日文等 UTF-8 编码字符。我已经弄清楚如何从 UTF-8 编码的 csv 或制表符分隔文件中提取数据,但对 Excel 文件却没有成功。有人可以帮忙吗?
(编辑: 评论中的代码片段)
HSSFSheet sheet = workbook.getSheet(worksheet);
HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook);
while (rowCtr <= lastRow && !rowBreakOut)
{
Row row = sheet.getRow(rowCtr);//rows.next();
for (int col=firstCell; col<lastCell && !breakOut; col++) {
Cell cell;
cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL);
if (ctype == Cell.CELL_TYPE_STRING) {
sValue = cell.getStringCellValue();
log.warn("String value = "+sValue);
String encoded = URLEncoder.encode(sValue, "UTF-8");
log.warn("URL-encoded with UTF-8: " + encoded);
....
I have been using POI to parse XLS and XLSX files successfully. However, I am unable to correctly extract special characters, such as UTF-8 encoded characters like Chinese or Japanese, from an Excel spreadsheet. I have figured out how to extract data from a UTF-8 encoded csv or tab delimited file, but no luck with the Excel file. Can anyone help?
(Edit: Code snippet from comments)
HSSFSheet sheet = workbook.getSheet(worksheet);
HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook);
while (rowCtr <= lastRow && !rowBreakOut)
{
Row row = sheet.getRow(rowCtr);//rows.next();
for (int col=firstCell; col<lastCell && !breakOut; col++) {
Cell cell;
cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL);
if (ctype == Cell.CELL_TYPE_STRING) {
sValue = cell.getStringCellValue();
log.warn("String value = "+sValue);
String encoded = URLEncoder.encode(sValue, "UTF-8");
log.warn("URL-encoded with UTF-8: " + encoded);
....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
从 Excel 文件中提取波斯语文本时,我遇到了同样的问题。我正在使用 Eclipse,只需转到 Project ->;属性并将“文本文件编码”更改为 UTF-8 解决了该问题。
I had the same problem while extracting Persian text from an Excel file. I was using Eclipse, and simply going to Project -> Properties and changing the "text file encoding" to UTF-8 solved the problem.
在 POI 中你可以像这样使用:
并且可以在 FontCharset 中使用另一个字符集
in POI you can use like this:
and can use another charset in FontCharset
解决方案很简单,读取任意编码的单元格字符串值(非英文字符);只需使用以下方法:
代替:
这适用于 UTF-8 编码的字符,如中文、阿拉伯文或日文。
PS 如果有人使用利用“Apache POI”库的命令行实用程序 nullpunkt/excel-to-json,请通过替换“getStringCellValue()”的出现来修改文件转换器/ExcelToJsonConverter.java避免将非英文字符读为“???”。
The solution is simple, to read cell string values of any encoding (non English characters); just use the following method:
instead of:
This applies to UTF-8 encoded characters like Chinese, Arabic or Japanese.
P.S if anybody is using the Command line utility nullpunkt/excel-to-json which utilize the "Apache POI" library, modify the file converter/ExcelToJsonConverter.java by replacing the occurrences of "getStringCellValue()" to avoid reading non-english characters as "???".
使用UTF获取字节如下
Get bytes using UTF as follows