Excel 电子表格中的字符编码(以及使用什么 Java 字符集对其进行解码)
我正在使用 JExcel 库来读取 Excel 电子表格。电子表格上的每个单元格可能包含 44 种语言(英语、葡萄牙语、法语、中文等)中的任何一种的本地化字符串。今天,我没有告诉 API 任何有关其应该使用的编码的信息。它可以处理中文,但它总是搞砸葡萄牙语和德语。不知何故,默认编码(我的开发盒上为 MacRoman,生产上为 UTF-8)无法正确解释从 Excel 工作簿中提取的字符串。 JExcel 解释文件字符编码的方式肯定有问题。
话虽这么说...
Excel工作簿中的所有字符串都使用相同的字符集编码吗?
是否有工作簿元数据我可以询问这个字符集是什么(我还没有找到)?
如果我通过 jchardet (http://jchardet.sourceforge.net/) 之类的东西运行所有单元格,是否可能能够预测整个工作簿的字符编码(这很大程度上取决于第一个问题是“是的,给定工作簿中的所有字符串都使用相同的字符集进行编码”)?
问题太多,时间太少。
I am using the JExcel library to read excel spreadsheets. Each cell on the spreadsheet may contain localization strings in any of something like 44 languages (English, Portugese, French, Chinese, etc). Today I don't tell the API anything regarding the encoding its supposed to use. Its handling the Chinese OK, but it always screws up Portugese and German. Somehow the default encoding (MacRoman on my dev box, UTF-8 on production) is failing to properly interpret the strings it pulls out of the excel workbook. There has to be something wrong with how JExcel is interpreting the character encoding of the file.
That being said...
Are all the strings in an excel workbook encoded with the same character set?
Is there workbook meta-data I can ask what this character set is (I haven't found it yet)?
If I run all the cells through something like jchardet (http://jchardet.sourceforge.net/), is it likely to be able to divine the character encoding for the whole workbook (this is pretty much predicated on the first question being "yes, all stings in a given workbook are encoded with the same character set")?
So many questions, so little time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,我没有直接得到答案,但 Matt 发现的规范为实际答案指明了方向: http ://sc.openoffice.org/excelfileformat.pdf
同时,只需将编码设置为始终“Cp1252”,我的问题就消失了。我不确定具体原因,但可以这么说,我并没有把礼物当作礼物,而是继续前进。
我将称之为“已回答”。
Well I didn't get an answer directly, but Matt's discovery of a spec points the way towards an actual answer: http://sc.openoffice.org/excelfileformat.pdf
In the mean time, my problem went away by just setting the encoding to always be "Cp1252". I'm not sure exactly why, but I'm not looking a gift horse in the mouth, so to speak, and am moving on.
I'll call this one answered.
我遇到的问题是,在从 Excel 文件读取单元格值时,某些值显示为“?”因为这对应于带重音的字母...该代码可以解决这个问题吗?因为当我在 Windows 下运行时,我无法像在 Linux 下那样快速测试(这是我要部署到的服务器的 SO)...
I have the problem that, while reading cell values from the excel file, some values appeared with "?" as this corresponds to letters with accent... Would that code resolve this issue ?. Because as I am running under windows, I cannot test as fast as If I would be under Linux (which is the SO of the server where I'm deploying to)...