JavaScript 中的 Unicode 和 URI 编码、解码和转义

发布于 2024-08-28 18:37:53 字数 744 浏览 6 评论 0原文

如果您查看此处的表格,它包含 Unicode 字符的转义序列列表这实际上对我不起作用。

例如,对于“%96”,它应该是 –,我在尝试解码时收到错误:

decodeURIComponent("%96");
URIError: URI malformed

如果我尝试编码“–”,我实际上得到:

encodeURIComponent("–");
"%E2%80%93"

我通过互联网搜索,我看到 此页面,其中提到分别使用decodeURIComponent和encodeURIComponent使用escape和unescape。这似乎没有帮助,因为无论我尝试什么,%96 都不会显示为“-”,这当然行不通:

decodeURIComponent(escape("%96));
"%96"

不是很有帮助。

如何使用 JavaScript 将“%96”变为“–”(无需为我可能遇到的每个可能的 unicode 字符硬编码映射)?

If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me.

For example for "%96", which should be a –, I get an error when trying decode:

decodeURIComponent("%96");
URIError: URI malformed

If I attempt to encode "–" I actually get:

encodeURIComponent("–");
"%E2%80%93"

I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work:

decodeURIComponent(escape("%96));
"%96"

Not very helpful.

How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

月朦胧 2024-09-04 18:37:53

URI 中的序列 %XX 编码一个“八位字节”,即八位字节。这就提出了一个问题:解码后的字节指的是哪个 Unicode 字符。如果我没记错的话,在旧版本的 URI 规范中,并没有很好地定义假定的字符集。在 URI 规范的更高版本中,建议使用 UTF-8 作为默认编码字符集。也就是说,要解码字节序列,您需要解码每个 %XX 序列,然后使用 UTF-8 字符集将结果字节转换为字符串。

这解释了为什么 %96 无法解码。十六进制 0x96 值不是有效的 UTF-8 序列。由于它超出了 ASCII,因此在它之前需要一个特殊的修饰符字节来指示扩展字符。 (有关更多详细信息,请参阅 UTF-8 规范。)JavaScript encodeURIComponent()decodeURIComponent() 方法都假定 UTF-8(正如它们应该的那样),所以我不会预计 %96 能够正确解码。

您引用的字符是 U+2013,一个破折号。您引用的页面到底是如何从十六进制 0x96(十进制 150)获得一个破折号的?他们显然没有假设 UTF-8 编码,这是标准。他们没有假设 ASCII,因为 ASCII 不包含该字符。他们甚至没有假设 ISO-8859-1,这是一种标准编码每个字符使用一个字节。事实证明,他们采用了特殊的 Windows 1252 代码页。也就是说,您尝试解码的 URI 假定用户使用的是 Windows 计算机,更糟糕的是,用户使用的是英语(或其他几种西方语言之一)的 Windows 计算机。

简而言之,您使用的表很糟糕。它已经过时,并且假设用户使用的是英语 Windows 系统。对非 ASCII 值进行编码的最新且正确的方法是将其转换为 UTF-8,然后使用 %XX 对每个八位字节进行编码。这就是为什么当您尝试对字符进行编码时会得到 %E2%80%93 ,而这正是 decodeURIComponent() 所期望的。您使用的 URI 编码不正确。如果您没有其他选择,您可以猜测 URI 使用的是 Windows 1252,自行转换字节,然后使用 Windows 1252 表来找出预期的 Unicode 值。但这是有风险的——您如何知道哪个 URI 使用哪个表?这就是为什么每个人都选择 UTF-8。如果可能,请告诉向您提供这些 URI 的人正确编码。

The sequence %XX in a URI encodes an "octet", that is, an eight-bit byte. This raises the question of what Unicode character that the decoded byte refers to. If my memory serves me correctly, in older versions of the URI specification, it was not well defined what charset was assumed. In later versions of the URI specification it was recommended that UTF-8 be the default encoding charset. That is, to decode a sequence of bytes, you would decode each %XX sequence and then convert the resulting bytes into a string using the UTF-8 character set.

This explains why %96 won't decode. The hex 0x96 value isn't a valid UTF-8 sequence. As it is lies beyond ASCII, it would need a special modifier byte before it to indicate an extended character. (See the UTF-8 specification for more details.) The JavaScript encodeURIComponent() and decodeURIComponent() methods both assume UTF-8 (as they should), so I wouldn't expect %96 to decode correctly.

The character you referenced is U+2013, an en-dash. How on earth does the page you reference get an en-dash from hex 0x96 (decimal 150)? They are obviously not assuming UTF-8 encoding, which is the standard. They are not assuming ASCII, which doesn't contain this character. They are not even assuming ISO-8859-1, which is a standard encoding that uses one byte per character. It turns out they are assuming the special Windows 1252 code page. That is, the URI yo u are trying to decode assumes that the user is on a Windows machine, and even worse, on a Windows machine in English (or one of a few other Western languages).

In short, the table you're using is bad. It's out-of-date and assumes that the user is on an English Windows system. The up-to-date and correct way to encode non-ASCII values is to convert them to UTF-8 and then encode each octet using %XX. That's why you got %E2%80%93 when you tried to encode the character, and that's what decodeURIComponent() is expecting. The URI you're using is not encoded correctly. If you have no other choice, you can guess that the URI is using Windows 1252, convert the bytes yourself, and then use a Windows 1252 table to find out what Unicode values were intended. But that's risky---how do you know which URI uses which table? That's why everybody settled on UTF-8. If possible, tell whoever is giving you these URIs to encode them correctly.

悟红尘 2024-09-04 18:37:53

作为社区 wiki 条目发布,因为它来自 Carl Henderson 的“构建可扩展网站”。书中说,重现示例的重要部分是可以的。您也许可以用它创建“-”的特殊情况。

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}

Posting as a community wiki entry as it's from "Building Scalable Websites" by Carl Henderson. The book says it's OK to reproduce significant portions of the examples though. You may be able to create a special case for "-" with it.

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}
梦里泪两行 2024-09-04 18:37:53

请参阅此问题,特别是这个答案

有一个特殊的“%uNNNN”格式
编码 Unicode UTF-16 代码点,
而不是编码 UTF-8 字节

我怀疑“–”是 Ascii 表 中的 0x96 以来的字符之一, 是û

See this question, specifically this answer:

there is a special “%uNNNN” format for
encoding Unicode UTF-16 code points,
instead of encoding UTF-8 bytes

I suspect "–" is one of those characters since 0x96 in the Ascii table is û

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文