在 Mathematica 中读取 UTF-8 编码的文本文件

发布于 2024-10-30 22:38:59 字数 523 浏览 1 评论 0原文

如何在 Mathematica 中读取 utf-8 编码文本文件？

这就是我现在正在做的事情：

text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];

但它告诉我

$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"

等等。我不知道为什么。我相信该文件是有效的 utf-8。

这是我要读取的文件：

http://dl.dropbox.com/u /38623/charData.txt

原文

How can I read a utf-8 encoded text file in Mathematica?

This is what I'm doing now:

text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];

but it tells me that

$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"

and so on. I am not sure why. I believe the file is valid utf-8.

Here's the file I'm trying to read:

http://dl.dropbox.com/u/38623/charData.txt

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

画骨成沙 2024-11-06 22:38:59

简短版本：Mathematica 的 UTF-8 功能不适用于超过 16 位的字符代码。如果可能，请改用 UTF-16 编码。但请注意，Mathematica 对 17 位以上字符代码的处理通常存在错误。长版本如下...

正如许多评论者所指出的，问题似乎出在 Mathematica 对代码大于 16 位的 Unicode 字符的支持上。引用的文本文件中的第一个此类字符是出现的 U+20B9B (

Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...

As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (????) which appears on line 10.

Some versions of the Mathematica front-end (like 8.0.1 on 64-bit Windows 7) can handle the character in question when entered directly:

In[1]:= $c="????";

But we run into trouble if we attempt to create the character from its Unicode:

In[2]:= 134043 // FromCharacterCode

During evaluation of In[2]:= FromCharacterCode::notunicode:
A character code, which should be a non-negative integer less
than 65536, is expected at position 1 in {134043}. >>
Out[2]= FromCharacterCode[134043]

One then wonders, what does Mathematica think the code is for this character?

In[3]:= $c // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[3]= {55362,57243}
Out[4]//BaseForm= {d842, df9b}
Out[5]//BaseForm= {1101100001000010, 1101111110011011}

Instead of a single Unicode value as one might expect, we get two codes which happen to match the UTF-16 representation of that character. Mathematica can perform the inverse transformation as well:

In[6]:= {55362,57243} // FromCharacterCode

Out[6]= ????

What, then, is Mathematica's conception of the UTF-8 encoding of this character?

In[7]:= ExportString[$c, "Text", CharacterEncoding -> "UTF8"] // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[7]= {237,161,130,237,190,155}
Out[8]//BaseForm= {ed, a1, 82, ed, be, 9b}
Out[9]//BaseForm= {11101101, 10100001, 10000010, 11101101, 10111110, 10011011}

The attentive reader will spot that this is the UTF-8 encoding of the UTF-16 encoding of the character. Can Mathematica decode this, um, interesting encoding?

In[10]:= ImportString[
           ExportString[{237,161,130,237,190,155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]

Out[10]= ????

Yes it can! But... so what?

How about the real UTF-8 expression of this character:

In[11]:= ImportString[
           ExportString[{240, 160, 174, 155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]
Out[11]= $CharacterEncoding::utf8: The byte sequence {240} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {160} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {174} could not be
interpreted as a character in the UTF-8 character encoding. >>
General::stop: Further output of $CharacterEncoding::utf8 will be suppressed
during this calculation. >>
ð ®

... but we see the failure reported in the original question.

How about UTF-16? UTF-16 is not on the list of valid character encodings, but "Unicode" is. Since we have already seen that Mathematica seems to use UTF-16 as its native format, let's give it a whirl (using big-endian UTF-16 with a byte-order-mark):

In[12]:= ImportString[
           ExportString[
             FromDigits[#, 16]& /@ {"fe", "ff", "d8", "42", "df", "9b"}
             , "Byte"
           ]
         , "Text"
         , CharacterEncoding -> "Unicode"
         ]
Out[12]= ????

It works. As a more complete experiment, I re-encoded the cited text file from the question into UTF-16 and imported it successfully.

The Mathematica documentation is largely silent on this subject. It is interesting to note that mention of Unicode in Mathematica appears to be accompanied by the assumption that character codes contain 16 bits. See, for example, references to Unicode in Raw Character Encodings.

The conclusion to be drawn from this is that Mathematica's support for UTF-8 transcoding is missing/buggy for codes longer than 16 bits. UTF-16, the apparent internal format of Mathematica, appears to work correctly. So that is a work-around if you are in a position to re-encode your files and you can accept that the resulting strings will actually be in UTF-16 format, not true Unicode strings.

Postscript

A little while after writing this response, I attempted to re-open the Mathematica notebook that contains it. Every occurrence of the problematic character in the notebook had been wiped out and replaced with gibberish. I guess there are yet more Unicode bugs to iron out, even in Mathematica 8.0.1 ;)

回复收藏 0 原文

~没有更多了~