当前位置：文江博客话题详情

使用 JavaScript 将文本截断为特定大小 (8 KB)

发布于 2024-08-06 12:21:04 字数 395 浏览 9 评论 0原文

我使用的是 Zemanta API，它每次调用最多接受 8 KB 的文本。我正在使用 JavaScript 从网页中提取要发送到 Zemanta 的文本，因此我正在寻找一个能够将我的文本截断为 8 KB 的函数。

Zemanta 应该自行执行此截断（即，如果您向其发送较大的字符串），但我需要在进行 API 调用之前稍微移动此文本，因此我希望有效负载尽可能小。

假设 8 KB 文本为 8,192 个字符并相应地截断是否安全？（每个字符 1 个字节；每 KB 1,024 个字符；8 KB = 8,192 个字节/字符）或者，这是不准确的还是仅在特定情况下才正确？

是否有更优雅的方法来根据实际文件大小截断字符串？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人海汹涌 2024-08-13 12:21:04

如果您使用单字节编码，是的，8192 个字符=8192 个字节。如果您使用 UTF-16，则 8192 个字符(*)=4096 个字节。

（实际上是 8192 个代码点，这在代理项面前略有不同，但我们不用担心，因为 JavaScript 不会。）

如果您使用 UTF-8，您可以使用一个快速技巧来实现JS 中的 UTF-8 编码器/解码器用最少的代码：

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

现在你可以用以下命令截断：（

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

这里使用 try-catch 的原因是，如果你截断多字节字符序列中间的字节，你将得到一个无效的 UTF- 8 流和decodeURIComponent 会抱怨。）

如果它是另一种多字节编码（例如Shift-JIS 或Big5），那么您就得靠自己了。

If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

Now you can truncate with:

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.

回复收藏 0 原文

霊感 2024-08-13 12:21:04

不，假设 8KB 文本为 8192 个字符是不安全的，因为在某些字符编码中，每个字符占用

如果你从文件中读取数据，你不能只获取文件大小吗？或者以 8KB 为单位读取？

回复收藏 0 原文

毁梦 2024-08-13 12:21:04

你可以这样做，因为 unescape 已部分弃用

function byteCount( string ) {
    // UTF8
    return encodeURI(string).split(/%..|./).length - 1;
}

function truncateByBytes(string, byteSize) {
    // UTF8
    if (byteCount(string) > byteSize) {
        const charsArray = string.split('');
        let truncatedStringArray = [];
        let bytesCounter = 0;
        for (let i = 0; i < charsArray.length; i++) {
            bytesCounter += byteCount(charsArray[i]);
            if (bytesCounter <= byteSize) {
                truncatedStringArray.push(charsArray[i]);
            } else {
                break;
            }
        }
        return truncatedStringArray.join('');
    }
    return string;
}

You can do something like this since unescape is partially deprecated

function byteCount( string ) {
    // UTF8
    return encodeURI(string).split(/%..|./).length - 1;
}

function truncateByBytes(string, byteSize) {
    // UTF8
    if (byteCount(string) > byteSize) {
        const charsArray = string.split('');
        let truncatedStringArray = [];
        let bytesCounter = 0;
        for (let i = 0; i < charsArray.length; i++) {
            bytesCounter += byteCount(charsArray[i]);
            if (bytesCounter <= byteSize) {
                truncatedStringArray.push(charsArray[i]);
            } else {
                break;
            }
        }
        return truncatedStringArray.join('');
    }
    return string;
}

回复收藏 0 原文