Flash CS4/AS3:控制台和文本区域之间打印 UTF-16 字符的不同行为
trace(escape("д"));
将打印“%D0%B4”,这是该字符的正确 URL 编码(相当于“A”的西里尔字母)。
但是,如果我这样做......
myTextArea.htmlText += unescape("%D0%B4");
打印出来的是:
д
,这当然是不正确的。不过,简单地跟踪上面的 unescape 就会返回正确的西里尔字符!对于此 texarea,转义“д”将返回其 unicode 代码点“%u0434”。
我不确定到底发生了什么事情搞砸了,但是......
网络编码中的 UTF-16 д 是: %FE%FF%00%D0%00%B4
而
网络编码中的 UTF-16 д 是: %00%D0%00%B4
所以它在开始时用一些东西填充这个值。为什么跟踪提供的文本与(空)文本区域的打印不同?发生什么事了?
如果这种事情可能的话,所讨论的文本区域没有附加奇怪的编码属性。
trace(escape("д"));
will print "%D0%B4", the correct URL encoding for this character (Cyrillic equivalent of "A").
However, if I were to do..
myTextArea.htmlText += unescape("%D0%B4");
What gets printed is:
д
which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping "д" returns its unicode code-point "%u0434".
I'm not sure what exactly is happening to mess this up, but...
UTF-16 д in web encoding is: %FE%FF%00%D0%00%B4
Whereas
UTF-16 д in web encoding is: %00%D0%00%B4
So it's padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What's goin' on?
The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题是
unescape
(escape
也可能是一个问题,但它不是本例中的罪魁祸首)。这些函数不支持多字节。escape
的作用是这样的:它接受输入字符串中的一个字节,并返回其十六进制表示形式,并在前面添加%
。unescape
则相反。这里的关键点是它们使用字节,而不是字符。您想要的是
encodeURIComponent
/decodeURIComponent
。两者都使用 utf-8 作为字符串编码方案(flash 到处都使用这种编码)。请注意,它不是 utf-16(只要涉及 Flash,您就不应该关心它)。现在,如果您想更深入地了解,请阅读以下内容(假设您对 utf-8 的工作原理有基本了解)。
这返回
为什么?
“д”被 flash 视为 utf-8。该字符的代码点是 0x0434。
在二进制中:
它适合两个 utf-8 字节,因此它是这样编码的(其中
e
表示编码位,p
表示有效负载位):将其转换为十六进制,我们得到:
所以,0xd0,0xb4是utf-8编码的“д”。
这被馈送到
escape
。escape
看到两个字节,并给出:现在,您将其传递给
unescape
。但是 unescape 有点死脑筋,所以它总是认为一个字节是一个字节,并且与一个字符是一样的。就unescape
而言,您有两个字节,因此,您有两个字符。如果您查找 0xd0 和 0xb4 的代码点,您会看到以下内容:因此,
unescape
返回一个由两个字符Ð
和´ 组成的字符串
(而不是弄清楚它得到的两个字节实际上只是一个字符,utf-8 编码)。然后,当您分配文本属性时,您实际上并不是传递д´,而是传递
д`,这就是您在文本区域中看到的内容。The problem is
unescape
(escape
could also be a problem, but it's not the culprit in this case). These functions are not multibyte aware. Whatescape
does is this: it takes a byte in the input string and returns its hex representation with a%
prepended.unescape
does the opposite. The key point here is that they work with bytes, not characters.What you want is
encodeURIComponent
/decodeURIComponent
. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it's not utf-16 (which you shouldn't care about as long as flash is concerned).Now, if you want to dig a bit deeper, here's what's going on (this assumes a basic knowledge of how utf-8 works).
This returns
Why?
"д" is treated by flash as utf-8. The codepoint for this character is 0x0434.
In binary:
It fits in two utf-8 bytes, so it's encoded thus (where
e
means encoding bit, andp
means payload bit):Converting it to hex, we get:
So, 0xd0,0xb4 is a utf-8 encoded "д".
This is fed to
escape
.escape
sees two bytes, and gives you:Now, you pass this to
unescape
. Butunescape
is a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far asunescape
is concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you'll see this:So,
unescape
returns a string consisting of two chars,Ð
and´
(instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passingд´ but
д`, and this is what you see in the text area.