在 Mathematica 中读取 UTF-8 编码的文本文件
如何在 Mathematica 中读取 utf-8 编码文本文件?
这就是我现在正在做的事情:
text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];
但它告诉我
$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"
等等。我不知道为什么。我相信该文件是有效的 utf-8。
这是我要读取的文件:
How can I read a utf-8 encoded text file in Mathematica?
This is what I'm doing now:
text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];
but it tells me that
$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"
and so on. I am not sure why. I believe the file is valid utf-8.
Here's the file I'm trying to read:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简短版本:Mathematica 的 UTF-8 功能不适用于超过 16 位的字符代码。如果可能,请改用 UTF-16 编码。但请注意,Mathematica 对 17 位以上字符代码的处理通常存在错误。长版本如下...
正如许多评论者所指出的,问题似乎出在 Mathematica 对代码大于 16 位的 Unicode 字符的支持上。引用的文本文件中的第一个此类字符是出现的 U+20B9B (
Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...
As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (????) which appears on line 10.
Some versions of the Mathematica front-end (like 8.0.1 on 64-bit Windows 7) can handle the character in question when entered directly:
But we run into trouble if we attempt to create the character from its Unicode:
One then wonders, what does Mathematica think the code is for this character?
Instead of a single Unicode value as one might expect, we get two codes which happen to match the UTF-16 representation of that character. Mathematica can perform the inverse transformation as well:
What, then, is Mathematica's conception of the UTF-8 encoding of this character?
The attentive reader will spot that this is the UTF-8 encoding of the UTF-16 encoding of the character. Can Mathematica decode this, um, interesting encoding?
Yes it can! But... so what?
How about the real UTF-8 expression of this character:
... but we see the failure reported in the original question.
How about UTF-16? UTF-16 is not on the list of valid character encodings, but
"Unicode"
is. Since we have already seen that Mathematica seems to use UTF-16 as its native format, let's give it a whirl (using big-endian UTF-16 with a byte-order-mark):It works. As a more complete experiment, I re-encoded the cited text file from the question into UTF-16 and imported it successfully.
The Mathematica documentation is largely silent on this subject. It is interesting to note that mention of Unicode in Mathematica appears to be accompanied by the assumption that character codes contain 16 bits. See, for example, references to Unicode in Raw Character Encodings.
The conclusion to be drawn from this is that Mathematica's support for UTF-8 transcoding is missing/buggy for codes longer than 16 bits. UTF-16, the apparent internal format of Mathematica, appears to work correctly. So that is a work-around if you are in a position to re-encode your files and you can accept that the resulting strings will actually be in UTF-16 format, not true Unicode strings.
Postscript
A little while after writing this response, I attempted to re-open the Mathematica notebook that contains it. Every occurrence of the problematic character in the notebook had been wiped out and replaced with gibberish. I guess there are yet more Unicode bugs to iron out, even in Mathematica 8.0.1 ;)