JavaScript 字符串 - UTF-16 与 UCS-2?

发布于 2024-12-24 04:00:51 字数 1293 浏览 2 评论 0 原文

我在某些地方读到过 JavaScript 字符串是 UTF-16,而在其他地方则是 UCS-2。我做了一些搜索,试图找出差异,发现了这一点:

问:UCS-2 和 UTF-16 有什么区别?

答:UCS-2 是一个过时的术语,指的是 Unicode 实现到 Unicode 1.1,在代理代码点和之前 UTF-16 被添加到该标准的 2.0 版中。这个词现在应该 避免。

UCS-2 没有定义不同的数据格式,因为 UTF-16 和 UCS-2 出于数据交换的目的,它们是相同的。两者都是 16 位的,并且具有 完全相同的代码单元表示。

过去有时会将实现标记为“UCS-2” 表明它不支持增补字符并且不 将代理代码点对解释为字符。这样一个 实现不会处理字符属性的处理, 补充字符的代码点边界、排序规则等。

通过: http://www.unicode.org/faq/utf_bom.html#utf16 -11

所以我的问题是,是否是因为 JavaScript 字符串对象的方法和索引作用于 16 位数据值而不是字符,所以有些人认为它是 UCS-2?如果是这样,面向字符而不是 16 位数据块的 JavaScript 字符串对象是否会被视为 UTF-16?还是我还缺少其他东西?

编辑:根据要求,这里有一些消息来源说 JavaScript 字符串是 UCS-2:

http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/ http://terenceyim.wordpress.com/tag/ucs2/

编辑:对于任何可能遇到此问题的人,请务必查看此链接:

http://mathiasbynens.be/notes/javascript-encoding

I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this:

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.

via: http://www.unicode.org/faq/utf_bom.html#utf16-11

So my question is, is it because the JavaScript string object's methods and indexes act on 16-bit data values instead of characters what make some people consider it UCS-2? And if so, would a JavaScript string object oriented around characters instead of 16-bit data chunks be considered UTF-16? Or is there something else I'm missing?

Edit: As requested, here are some sources saying JavaScript strings are UCS-2:

http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/
http://terenceyim.wordpress.com/tag/ucs2/

EDIT: For anyone who may come across this, be sure to check out this link:

http://mathiasbynens.be/notes/javascript-encoding

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

还给你自由 2024-12-31 04:00:51

JavaScript,严格来说,ECMAScript,早于 Unicode 2.0,因此在某些情况下,您可能会发现对 UCS-2 的引用,只是因为在编写引用时这是正确的。您能否向我们指出 JavaScript 为“UCS-2”的具体引用?

ECMAScript 版本规范 35 至少两者都显式声明 String< /code> 是无符号 16 位整数的集合,如果这些整数值旨在表示文本数据,那么它们就是 UTF-16代码单位。请参阅


编辑:我不再确定我的答案是否完全正确。请参阅上面提到的优秀文章,它本质上说虽然 JavaScript 引擎可能使用 UTF- 16 在内部,大多数语言本身都会有效地公开这些字符,就像它们是 UCS-2 一样。

JavaScript, strictly speaking, ECMAScript, pre-dates Unicode 2.0, so in some cases you may find references to UCS-2 simply because that was correct at the time the reference was written. Can you point us to specific citations of JavaScript being "UCS-2"?

Specifications for ECMAScript versions 3 and 5 at least both explicitly declare a String to be a collection of unsigned 16-bit integers and that if those integer values are meant to represent textual data, then they are UTF-16 code units. See


EDIT: I'm no longer sure my answer is entirely correct. See the excellent article mentioned above, which in essence says that while a JavaScript engine may use UTF-16 internally, and most do, the language itself effectively exposes those characters as if they were UCS-2.

拥抱没勇气 2024-12-31 04:00:51

它是 UTF-16/USC-2。它可以处理代理项对,但 charAt/charCodeAt 返回 16 位字符而不是 Unicode 代码点。如果您想让它处理代理对,我建议快速阅读 this

It's UTF-16/USC-2. It can handle surrogate pairs, but the charAt/charCodeAt returns a 16-bit char and not the Unicode codepoint. If you want to have it handle surrogate pairs, I suggest a quick read through this.

回首观望 2024-12-31 04:00:51

它只是一个 16 位值,ECMAScript 标准中没有指定编码。

请参阅本文档中的第 7.8.4 节字符串文字:http: //www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

Its just a 16-bit value with no encoding specified in the ECMAScript standard.

See section 7.8.4 String Literals in this document: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

蓝眼睛不忧郁 2024-12-31 04:00:51

您需要区分它的存储方式和解释方式。

在 Javascript 中,字符串是 16 位无符号整数的序列,通常但不一定被解释为 UTF-16 编码的字符序列。 它是无编码的,您的代码、标准 Javascript 方法或 REPL 终端可以用它们想要的任何编码来解释它。

ECMA-262 第十三版(ECMAScript® 2022 语言规范)

§4.4.20 字符串

原始值是零个或多个 16 位无符号整数值的有限有序序列

<块引用>

注意 String 值是 String 类型的成员。序列中的每个整数值通常代表UTF-16文本的单个16位单元。但是,ECMAScript 对这些值没有任何限制或要求,只是它们必须是 16 位无符号整数。

因此,Javascript 字符串可以毫无问题地包含无效的值序列在 UTF-16 中,例如单独(“不匹配”)代理

const javascript_string = "\uDF06"; // a lone surrogate
javascript_string.isWellFormed(); // false

You need to differentiate how it is stored and how it is interpreted.

In Javascript, a string is a sequence of 16-bit unsigned integers that is, usually but not necessarily, interpreted as a UTF-16-encoded character sequence. It is encodingless, and your code, standard Javascript methods, or REPL terminals, may interpret it in whatever encodings they want.

The thirteenth edition of ECMA-262 (ECMAScript® 2022 language specification)

§4.4.20 String value

primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer values

NOTE A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

Because of this, Javascript strings can contain, with no problems, a value sequence that is invalid in UTF-16, such as lone (“unmatched”) surrogates.

const javascript_string = "\uDF06"; // a lone surrogate
javascript_string.isWellFormed(); // false
一身仙ぐ女味 2024-12-31 04:00:51

自 2012 年以来,情况发生了变化。JavaScript 字符串现在是真正的 UTF-16。是的,旧的字符串方法仍然适用于 16 位代码单元,但该语言现在可以识别 UTF-16 代理项,并且知道如果您使用 字符串迭代器。还有Unicode 正则表达式支持

// Before
"

Things have changed since 2012. JavaScript strings are now UTF-16 for real. Yes, the old string methods still work on 16-bit code units, but the language is now aware of UTF-16 surrogates and knows what to do about them if you use the string iterator. There's also Unicode regex support.

// Before
"????????????".length // 6

// Now
[..."????????????"].length // 3
[..."????????????"]  // [ '????', '????', '????' ]
[... "????????????".matchAll(/./ug) ] // 3 matches as above

// Regexes support unicode character classes
"café".normalize("NFD").match(/\p{L}\p{M}/ug) // [ 'é' ]

// Extract code points
[..."????????????"].map(char => char.codePointAt(0).toString(16)) // [ '1f600', '1f602', '1f4a9' ]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文