JavaScript 字符串有多少字节?
我有一个 javascript 字符串,当从服务器以 UTF-8 格式发送时,该字符串大约为 500K。我如何在 JavaScript 中知道它的大小?
我知道 JavaScript 使用 UCS-2,所以这是否意味着每个字符 2 个字节。但是,它依赖于 JavaScript 实现吗?或者在页面编码或内容类型上?
I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
您可以使用 Blob 来获取字符串大小(以字节为单位)。
示例:
You can use the Blob to get the string size in bytes.
Examples:
此函数将返回您传递给它的任何 UTF-8 字符串的字节大小。
来源
JavaScript 引擎可以在内部自由使用 UCS-2 或 UTF-16。据我所知,大多数引擎都使用 UTF-16,但无论他们做出什么选择,这都只是一个实现细节,不会影响语言的特性。
然而,ECMAScript/JavaScript 语言本身根据 UCS-2 而不是 UTF-16 公开字符。
来源
This function will return the byte size of any UTF-8 string you pass to it.
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
如果您使用的是node.js,则有一个更简单的解决方案,使用缓冲区:
有一个npm lib:< a href="https://www.npmjs.org/package/utf8-binary-cutter">https://www.npmjs.org/package/utf8-binary-cutter (忠实地来自您)
If you're using node.js, there is a simpler solution using buffers :
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
根据 ECMA-,
String
值与实现无关262 第三版规范,每个字符代表 UTF-16 文本的单个 16 位单元:String
values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:这是我使用的 3 种方法:
TextEncoder
Blob
Buffer
These are 3 ways I use:
TextEncoder
Blob
Buffer
尝试使用 unescape js 函数来组合:
完整编码过程示例:
Try this combination with using unescape js function:
Full encode proccess example:
请注意,如果您的目标是node.js,则可以使用
Buffer.from(string)。长度
:Note that if you're targeting node.js you can use
Buffer.from(string).length
:JavaScript 字符串的大小为
或每个字符 5 个或更多字节
ES6 之前
每个字符始终为 2 个字节。不允许使用 UTF-16,因为规范规定“值必须是 16 位无符号整数”。由于 UTF-16 字符串可以使用 3 或 4 字节字符,因此会违反 2 字节要求。至关重要的是,虽然无法完全支持 UTF-16,但该标准确实要求使用的两个字节字符是有效的 UTF-16 字符。换句话说,ES6 之前的 JavaScript 字符串支持 UTF-16 字符的子集。
ES6 及更高版本
每个字符 2 个字节,或者每个字符 5 个或更多字节。额外的大小开始发挥作用,因为 ES6 (ECMAScript 6) 添加了对 Unicode 代码点转义的支持。使用 unicode 转义如下所示: \u{1D306}
实用说明
这与特定引擎的内部实现无关。为了
例如,一些引擎使用具有完整功能的数据结构和库
UTF-16 支持,但他们外部提供的不一定是
完整的 UTF-16 支持。引擎还可以提供外部 UTF-16
也支持,但没有强制要求这样做。
对于 ES6,实际上字符数永远不会超过 5 个
字节长(2 个字节用于转义点 + 3 个字节用于 Unicode
代码点)因为最新版本的 Unicode 只有 136,755
可能的字符,很容易容纳 3 个字节。然而这是
技术上不受标准限制,因此原则上单一
字符可以使用 4 个字节的代码点和 6 个字节
总计。
此处用于计算字节大小的大多数代码示例似乎没有考虑 ES6 Unicode 代码点转义,因此在某些情况下结果可能不正确。
The size of a JavaScript string is
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 每个代码点使用 1 到 4 个字节对字符进行编码。正如 CMS 在接受的答案中指出的那样,JavaScript 将使用 16 位(2 个字节)在内部存储每个字符。
如果您通过循环解析字符串中的每个字符并计算每个代码点使用的字节数,然后将总计数乘以 2,您应该获得该 UTF-8 编码字符串的 JavaScript 内存使用情况(以字节为单位)。也许是这样的:
示例:
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
Examples:
Lauri Oherd 的答案对于大多数在野外看到的字符串都适用,但如果字符串包含代理对范围(0xD800 到 0xDFFF)中的单独字符,则会失败。例如,
这个较长的函数应该处理所有字符串:
例如,
它将正确计算包含代理项对的字符串的大小:
结果可以与 Node 的内置函数 Buffer.byteLength 进行比较:
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
This longer function should handle all strings:
E.g.
It will correctly calculate the size for strings containing surrogate pairs:
The results can be compared with Node's built-in function
Buffer.byteLength
:JavaScript 字符串中的单个元素被视为单个 UTF-16 代码单元。也就是说,Strings字符是以16位(1个代码单元)存储的,16位等于2个字节(8位=1个字节)。
charCodeAt()
方法可用于返回 0 到 65535 之间的整数,表示给定索引处的 UTF-16 代码单元。codePointAt()
可用于返回 Unicode 字符的完整代码点值,例如 UTF-32。当 UTF-16 字符无法用单个 16 位代码单元表示时,它将有一个代理对,因此使用两个代码单元(2 x 16 位 = 4 字节)
请参阅 Unicode 编码,用于不同的编码及其代码范围。
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The
charCodeAt()
method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.The
codePointAt()
can be used to return the entire code point value for Unicode characters, e.g. UTF-32.When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
Blob 接口的 size 属性返回 Blob 的大小或文件以字节为单位。
The Blob interface's size property returns the size of the Blob or File in bytes.
我正在使用 V8 引擎的嵌入式版本。
我已经测试了单个字符串。每步推1000个字符。 UTF-8。
首先使用单字节(8 位,ANSI)字符“A”(十六进制:41)进行测试。
第二次测试使用双字节字符(16 位)“Ω”(十六进制:CE A9)和
第三次测试使用三字节字符(24 位)“☺”(十六进制:E2 98 BA)。
在所有三种情况下,设备都会打印出内存不足的信息
888 000 个字符,使用约。 RAM 中为 26 348 kb。
结果:字符没有动态存储。而且不仅仅是 16 位。 - 好吧,也许只适合我的情况(嵌入式 128 MB RAM 设备,V8 引擎 C++/QT) - 字符编码与 javascript 引擎的 RAM 大小无关。例如encodingURI等仅对高级数据传输和存储有用。
嵌入与否,事实是字符不仅仅存储在16bit中。
不幸的是我没有 100% 的答案,Javascript 在低级别区域做什么。
顺便提一句。我已经用字符“A”数组进行了相同的测试(上面的第一个测试)。
每一步推送 1000 个项目。 (完全相同的测试。只是将字符串替换为数组)并且系统在使用 10 416 KB 且数组长度为 1 337 000 后内存不足(需要)。
所以,javascript引擎并没有受到简单的限制。这是一种更复杂的情况。
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "Ω" (hex: CE A9) and the
third test with three byte character (24bit) "☺" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
你可以试试这个:
它对我有用。
You can try this:
It worked for me.