javascript 中 charcode 中的 Unicode 字符 for charcodes > 0xFFFF

发布于 2024-10-27 03:07:42 字数 266 浏览 7 评论 0原文

我需要从 unicode 字符代码获取字符串/字符,最后将其放入 DOM TextNode 中,以使用客户端 JavaScript 添加到 HTML 页面中。

目前,我正在做:

String.fromCharCode(parseInt(charcode, 16));

其中charcode是包含charcode的十六进制字符串,例如“1D400”。应该返回的unicode字符是

I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.

Currently, I am doing:

String.fromCharCode(parseInt(charcode, 16));

where charcode is a hex string containing the charcode, e.g. "1D400". The unicode character which should be returned is ????, but a is returned! Characters in the 16 bit range (0000 ... FFFF) are returned as expected.

Any explanation and / or proposals for correction?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

白况 2024-11-03 03:07:42

String.fromCharCode 只能处理BMP 中的代码点(即最大U+FFFF)。为了处理更高的代码点,可以使用 Mozilla 开发者网络 的此函数返回代理对表示:

function fixedFromCharCode (codePt) {
    if (codePt > 0xFFFF) {
        codePt -= 0x10000;
        return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
    } else {
        return String.fromCharCode(codePt);
    }
}

String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:

function fixedFromCharCode (codePt) {
    if (codePt > 0xFFFF) {
        codePt -= 0x10000;
        return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
    } else {
        return String.fromCharCode(codePt);
    }
}
薄荷梦 2024-11-03 03:07:42

问题是 JavaScript 中的字符是(大部分)UCS-2 编码,但可以表示一个字符在 JavaScript 中的基本多语言平面之外作为 UTF-16 代理对。

以下函数改编自 将带有破折号字符的 punycode 转换为 Unicode

function utf16Encode(input) {
    var output = [], i = 0, len = input.length, value;
    while (i < len) {
        value = input[i++];
        if ( (value & 0xF800) === 0xD800 ) {
            throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
        }
        if (value > 0xFFFF) {
            value -= 0x10000;
            output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
            value = 0xDC00 | (value & 0x3FF);
        }
        output.push(String.fromCharCode(value));
    }
    return output.join("");
}

alert( utf16Encode([0x1D400]) );

The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.

The following function is adapted from Converting punycode with dash character to Unicode:

function utf16Encode(input) {
    var output = [], i = 0, len = input.length, value;
    while (i < len) {
        value = input[i++];
        if ( (value & 0xF800) === 0xD800 ) {
            throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
        }
        if (value > 0xFFFF) {
            value -= 0x10000;
            output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
            value = 0xDC00 | (value & 0x3FF);
        }
        output.push(String.fromCharCode(value));
    }
    return output.join("");
}

alert( utf16Encode([0x1D400]) );
少女的英雄梦 2024-11-03 03:07:42

EcmaScript 语言规范第 8.4 节说

当字符串包含实际文本数据时,每个元素都被视为单个 UTF-16 代码单元。无论这是否是字符串的实际存储格式,字符串中的字符均按其初始代码单元元素位置进行编号,就像使用 UTF-16 表示一样。对字符串的所有操作(除非另有说明)都将它们视为无差别的 16 位无符号整数序列;它们不确保生成的字符串采用规范化形式,也不确保对语言敏感的结果。

因此,您需要将补充代码点编码为 UTF-16 代码单元对。

文章“Java 平台中的增补字符” 很好地描述了如何执行此操作。

UTF-16 使用一或两个无符号 16 位代码单元的序列来对 Unicode 代码点进行编码。值 U+0000 至 U+FFFF 被编码在具有相同值的 16 位单元中。增补字符以两个代码单元进行编码,第一个来自高代理范围(U+D800 到 U+DBFF),第二个来自低代理范围(U+DC00 到 U+DFFF)。这在概念上看起来与多字节编码相似,但有一个重要的区别:U+D800 到 U+DFFF 的值保留用于 UTF-16;没有字符被分配给它们作为代码点。这意味着,软件可以判断字符串中的每个单独的代码单元是否表示一个单元字符,或者它是否是双单元字符的第一个或第二个单元。与某些传统的多字节字符编码相比,这是一个重大改进,其中字节值 0x41 可能表示字母“A”或者是双字节字符的第二个字节。

下表显示了几个字符的不同表示形式的比较:

代码点/UTF-16 代码单元

U+0041 / 0041

U+00DF / 00DF

U+6771 / 6771

U+10400 / D801 DC00

了解 UTF-16 后代码单元,您可以使用 javascript 函数 String.fromCharCode 创建字符串:

String.fromCharCode(0xd801, 0xdc00) === '

Section 8.4 of the EcmaScript language spec says

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

So you need to encode supplemental code-points as pairs of UTF-16 code units.

The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

The following table shows the different representations of a few characters in comparison:

code points / UTF-16 code units

U+0041 / 0041

U+00DF / 00DF

U+6771 / 6771

U+10400 / D801 DC00

Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:

String.fromCharCode(0xd801, 0xdc00) === '????'
像你 2024-11-03 03:07:42

String.fromCodePoint() 似乎也能做到这一点。请参阅此处

console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));

输出:

String.fromCodePoint() seems to do the trick as well. See here.

console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));

Output:

????????????????
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文