当前位置：文江博客话题详情

JavaScript 字符串有多少字节？

发布于 2024-08-20 18:47:45 字数 172 浏览 7 评论 0原文

我有一个 javascript 字符串，当从服务器以 UTF-8 格式发送时，该字符串大约为 500K。我如何在 JavaScript 中知道它的大小？

我知道 JavaScript 使用 UCS-2，所以这是否意味着每个字符 2 个字节。但是，它依赖于 JavaScript 实现吗？或者在页面编码或内容类型上？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温柔戏命师 2024-08-27 18:47:45

您可以使用 Blob 来获取字符串大小（以字节为单位）。

示例：

console.info(

  new Blob(['

You can use the Blob to get the string size in bytes.

Examples:

console.info(
  new Blob(['????']).size,                             // 4
  new Blob(['????']).size,                             // 4
  new Blob(['????????']).size,                           // 8
  new Blob(['????????']).size,                           // 8
  new Blob(['I\'m a string']).size,                  // 12

  // from Premasagar correction of Lauri's answer for
  // strings containing lone characters in the surrogate pair range:
  // https://stackoverflow.com/a/39488643/6225838
  new Blob([String.fromCharCode(55555)]).size,       // 3
  new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);

回复收藏 0 原文

豆芽 2024-08-27 18:47:45

此函数将返回您传递给它的任何 UTF-8 字符串的字节大小。

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

来源

JavaScript 引擎可以在内部自由使用 UCS-2 或 UTF-16。据我所知，大多数引擎都使用 UTF-16，但无论他们做出什么选择，这都只是一个实现细节，不会影响语言的特性。

然而，ECMAScript/JavaScript 语言本身根据 UCS-2 而不是 UTF-16 公开字符。

来源

This function will return the byte size of any UTF-8 string you pass to it.

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.

The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

Source

回复收藏 0 原文

带刺的爱情 2024-08-27 18:47:45

如果您使用的是node.js，则有一个更简单的解决方案，使用缓冲区：

function getBinarySize(string) {
    return Buffer.byteLength(string, 'utf8');
}

有一个npm lib：< a href="https://www.npmjs.org/package/utf8-binary-cutter">https://www.npmjs.org/package/utf8-binary-cutter （忠实地来自您）

If you're using node.js, there is a simpler solution using buffers :

function getBinarySize(string) {
    return Buffer.byteLength(string, 'utf8');
}

There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)

回复收藏 0 原文

云仙小弟 2024-08-27 18:47:45

根据 ECMA-，String 值与实现无关262 第三版规范，每个字符代表 UTF-16 文本的单个 16 位单元：

4.3.16 字符串值
字符串值是 String 类型的成员，并且是
零或的有限有序序列
更多 16 位无符号整数值。
注意虽然每个值通常
代表单个 16 位单元
UTF-16 文本，语言不
提出任何限制或要求
关于这些值，除了它们是
16 位无符号整数。

回复收藏 0 原文

放赐 2024-08-27 18:47:45

这是我使用的 3 种方法：

TextEncoder

new TextEncoder().encode("myString").length

Blob

new Blob(["myString"]).size

Buffer

Buffer.byteLength("myString", 'utf8')

These are 3 ways I use:

TextEncoder

new TextEncoder().encode("myString").length

Blob

new Blob(["myString"]).size

Buffer

Buffer.byteLength("myString", 'utf8')

回复收藏 0 原文

烏雲後面有陽光 2024-08-27 18:47:45

尝试使用 unescape js 函数来组合：

const byteAmount = unescape(encodeURIComponent(yourString)).length

完整编码过程示例：

const s  = "1 a ф № @ ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,@-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11

Try this combination with using unescape js function:

const byteAmount = unescape(encodeURIComponent(yourString)).length

Full encode proccess example:

const s  = "1 a ф № @ ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,@-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11

回复收藏 0 原文

狠疯拽 2024-08-27 18:47:45

请注意，如果您的目标是node.js，则可以使用 Buffer.from(string)。长度：

var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)

Note that if you're targeting node.js you can use Buffer.from(string).length:

var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)

回复收藏 0 原文

天涯沦落人 2024-08-27 18:47:45

JavaScript 字符串的大小为

ES6 之前：每个字符 2 个字节
ES6 及更高版本：每个字符 2 个字节，
或每个字符 5 个或更多字节

ES6 之前
每个字符始终为 2 个字节。不允许使用 UTF-16，因为规范规定“值必须是 16 位无符号整数”。由于 UTF-16 字符串可以使用 3 或 4 字节字符，因此会违反 2 字节要求。至关重要的是，虽然无法完全支持 UTF-16，但该标准确实要求使用的两个字节字符是有效的 UTF-16 字符。换句话说，ES6 之前的 JavaScript 字符串支持 UTF-16 字符的子集。

ES6 及更高版本
每个字符 2 个字节，或者每个字符 5 个或更多字节。额外的大小开始发挥作用，因为 ES6 (ECMAScript 6) 添加了对 Unicode 代码点转义的支持。使用 unicode 转义如下所示： \u{1D306}

实用说明

这与特定引擎的内部实现无关。为了
例如，一些引擎使用具有完整功能的数据结构和库
UTF-16 支持，但他们外部提供的不一定是
完整的 UTF-16 支持。引擎还可以提供外部 UTF-16
也支持，但没有强制要求这样做。
对于 ES6，实际上字符数永远不会超过 5 个
字节长（2 个字节用于转义点 + 3 个字节用于 Unicode
代码点）因为最新版本的 Unicode 只有 136,755
可能的字符，很容易容纳 3 个字节。然而这是
技术上不受标准限制，因此原则上单一
字符可以使用 4 个字节的代码点和 6 个字节
总计。
此处用于计算字节大小的大多数代码示例似乎没有考虑 ES6 Unicode 代码点转义，因此在某些情况下结果可能不正确。

回复收藏 0 原文

自我难过 2024-08-27 18:47:45

UTF-8 每个代码点使用 1 到 4 个字节对字符进行编码。正如 CMS 在接受的答案中指出的那样，JavaScript 将使用 16 位（2 个字节）在内部存储每个字符。

如果您通过循环解析字符串中的每个字符并计算每个代码点使用的字节数，然后将总计数乘以 2，您应该获得该 UTF-8 编码字符串的 JavaScript 内存使用情况（以字节为单位）。也许是这样的：

      getStringMemorySize = function( _string ) {
        "use strict";

        var codePoint
            , accum = 0
        ;

        for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
            codePoint = _string.charCodeAt( stringIndex );

            if( codePoint < 0x100 ) {
                accum += 1;
                continue;
            }

            if( codePoint < 0x10000 ) {
                accum += 2;
                continue;
            }

            if( codePoint < 0x1000000 ) {
                accum += 3;
            } else {
                accum += 4;
            }
        }

        return accum * 2;
    }

示例：

getStringMemorySize( 'I'    );     //  2

getStringMemorySize( '❤'    );     //  4

getStringMemorySize( '

UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).

If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:

      getStringMemorySize = function( _string ) {
        "use strict";

        var codePoint
            , accum = 0
        ;

        for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
            codePoint = _string.charCodeAt( stringIndex );

            if( codePoint < 0x100 ) {
                accum += 1;
                continue;
            }

            if( codePoint < 0x10000 ) {
                accum += 2;
                continue;
            }

            if( codePoint < 0x1000000 ) {
                accum += 3;
            } else {
                accum += 4;
            }
        }

        return accum * 2;
    }

Examples:

getStringMemorySize( 'I'    );     //  2
getStringMemorySize( '❤'    );     //  4
getStringMemorySize( '????'   );     //  8
getStringMemorySize( 'I❤????' );     // 14

回复收藏 0 原文

红焚 2024-08-27 18:47:45

Lauri Oherd 的答案对于大多数在野外看到的字符串都适用，但如果字符串包含代理对范围（0xD800 到 0xDFFF）中的单独字符，则会失败。例如，

byteCount(String.fromCharCode(55555))
// URIError: URI malformed

这个较长的函数应该处理所有字符串：

function bytes (str) {
  var bytes=0, len=str.length, codePoint, next, i;

  for (i=0; i < len; i++) {
    codePoint = str.charCodeAt(i);

    // Lone surrogates cannot be passed to encodeURI
    if (codePoint >= 0xD800 && codePoint < 0xE000) {
      if (codePoint < 0xDC00 && i + 1 < len) {
        next = str.charCodeAt(i + 1);

        if (next >= 0xDC00 && next < 0xE000) {
          bytes += 4;
          i++;
          continue;
        }
      }
    }

    bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
  }

  return bytes;
}

例如，

bytes(String.fromCharCode(55555))
// 3

它将正确计算包含代理项对的字符串的大小：

bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)

结果可以与 Node 的内置函数 Buffer.byteLength 进行比较：

Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3

Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)

The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.

byteCount(String.fromCharCode(55555))
// URIError: URI malformed

This longer function should handle all strings:

function bytes (str) {
  var bytes=0, len=str.length, codePoint, next, i;

  for (i=0; i < len; i++) {
    codePoint = str.charCodeAt(i);

    // Lone surrogates cannot be passed to encodeURI
    if (codePoint >= 0xD800 && codePoint < 0xE000) {
      if (codePoint < 0xDC00 && i + 1 < len) {
        next = str.charCodeAt(i + 1);

        if (next >= 0xDC00 && next < 0xE000) {
          bytes += 4;
          i++;
          continue;
        }
      }
    }

    bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
  }

  return bytes;
}

E.g.

bytes(String.fromCharCode(55555))
// 3

It will correctly calculate the size for strings containing surrogate pairs:

bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)

The results can be compared with Node's built-in function Buffer.byteLength:

Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3

Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)

回复收藏 0 原文

九公里浅绿 2024-08-27 18:47:45

JavaScript 字符串中的单个元素被视为单个 UTF-16 代码单元。也就是说，Strings字符是以16位（1个代码单元）存储的，16位等于2个字节（8位=1个字节）。

charCodeAt() 方法可用于返回 0 到 65535 之间的整数，表示给定索引处的 UTF-16 代码单元。

codePointAt() 可用于返回 Unicode 字符的完整代码点值，例如 UTF-32。

当 UTF-16 字符无法用单个 16 位代码单元表示时，它将有一个代理对，因此使用两个代码单元（2 x 16 位 = 4 字节）

请参阅 Unicode 编码，用于不同的编码及其代码范围。

回复收藏 0 原文

冷了相思 2024-08-27 18:47:45

Blob 接口的 size 属性返回 Blob 的大小或文件以字节为单位。

const getStringSize = (s) => new Blob([s]).size;

The Blob interface's size property returns the size of the Blob or File in bytes.

const getStringSize = (s) => new Blob([s]).size;

回复收藏 0 原文

你的背包 2024-08-27 18:47:45

我正在使用 V8 引擎的嵌入式版本。
我已经测试了单个字符串。每步推1000个字符。 UTF-8。

首先使用单字节（8 位，ANSI）字符“A”（十六进制：41）进行测试。
第二次测试使用双字节字符（16 位）“Ω”（十六进制：CE A9）和
第三次测试使用三字节字符（24 位）“☺”（十六进制：E2 98 BA）。

在所有三种情况下，设备都会打印出内存不足的信息
888 000 个字符，使用约。 RAM 中为 26 348 kb。

结果：字符没有动态存储。而且不仅仅是 16 位。 - 好吧，也许只适合我的情况（嵌入式 128 MB RAM 设备，V8 引擎 C++/QT） - 字符编码与 javascript 引擎的 RAM 大小无关。例如encodingURI等仅对高级数据传输和存储有用。

嵌入与否，事实是字符不仅仅存储在16bit中。
不幸的是我没有 100% 的答案，Javascript 在低级别区域做什么。
顺便提一句。我已经用字符“A”数组进行了相同的测试（上面的第一个测试）。
每一步推送 1000 个项目。（完全相同的测试。只是将字符串替换为数组）并且系统在使用 10 416 KB 且数组长度为 1 337 000 后内存不足（需要）。
所以，javascript引擎并没有受到简单的限制。这是一种更复杂的情况。

回复收藏 0 原文

彩扇题诗 2024-08-27 18:47:45

你可以试试这个：

  var b = str.match(/[^\x00-\xff]/g);
  return (str.length + (!b ? 0: b.length));

它对我有用。

You can try this:

  var b = str.match(/[^\x00-\xff]/g);
  return (str.length + (!b ? 0: b.length));

It worked for me.

回复收藏 0 原文

~没有更多了~

关于作者

思念绕指尖

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

JavaScript 字符串有多少字节？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

JavaScript 字符串的大小为

The size of a JavaScript string is

关于作者

相关话题

热门标签

推荐作者

linfzu01

§对你不离不弃

可遇━不可求

枕梦

qq_3LFa8Q

JP

友情链接

JavaScript 字符串有多少字节？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

JavaScript 字符串的大小为

The size of a JavaScript string is

关于作者

相关话题

热门标签

推荐作者

linfzu01

§对你不离不弃

可遇━不可求

枕梦

qq_3LFa8Q

JP

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。