使用 JavaScript 将文本截断为特定大小 (8 KB)

发布于 2024-08-06 12:21:04 字数 395 浏览 9 评论 0原文

我使用的是 Zemanta API,它每次调用最多接受 8 KB 的文本。我正在使用 JavaScript 从网页中提取要发送到 Zemanta 的文本,因此我正在寻找一个能够将我的文本截断为 8 KB 的函数。

Zemanta 应该自行执行此截断(即,如果您向其发送较大的字符串),但我需要在进行 API 调用之前稍微移动此文本,因此我希望有效负载尽可能小。

假设 8 KB 文本为 8,192 个字符并相应地截断是否安全? (每个字符 1 个字节;每 KB 1,024 个字符;8 KB = 8,192 个字节/字符)或者,这是不准确的还是仅在特定情况下才正确?

是否有更优雅的方法来根据实际文件大小截断字符串?

I'm using the Zemanta API, which accepts up to 8 KB of text per call. I'm extracting the text to send to Zemanta from Web pages using JavaScript, so I'm looking for a function that will truncate my text at exactly 8 KB.

Zemanta should do this truncation on its own (i.e., if you send it a larger string), but I need to shuttle this text around a bit before making the API call, so I want to keep the payload as small as possible.

Is it safe to assume that 8 KB of text is 8,192 characters, and to truncate accordingly? (1 byte per character; 1,024 characters per KB; 8 KB = 8,192 bytes/characters) Or, is that inaccurate or only true given certain circumstances?

Is there a more elegant way to truncate a string based on its actual file size?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

人海汹涌 2024-08-13 12:21:04

如果您使用单字节编码,是的,8192 个字符=8192 个字节。如果您使用 UTF-16,则 8192 个字符(*)=4096 个字节。

(实际上是 8192 个代码点,这在代理项面前略有不同,但我们不用担心,因为 JavaScript 不会。)

如果您使用 UTF-8,您可以使用一个快速技巧来实现JS 中的 UTF-8 编码器/解码器用最少的代码:

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

现在你可以用以下命令截断:(

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

这里使用 try-catch 的原因是,如果你截断多字节字符序列中间的字节,你将得到一个无效的 UTF- 8 流和decodeURIComponent 会抱怨。)

如果它是另一种多字节编码(例如Shift-JIS 或Big5),那么您就得靠自己了。

If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

Now you can truncate with:

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.

霊感 2024-08-13 12:21:04

不,假设 8KB 文本为 8192 个字符是不安全的,因为在某些字符编码中,每个字符占用

如果你从文件中读取数据,你不能只获取文件大小吗?或者以 8KB 为单位读取?

No it's not safe to assume that 8KB of text is 8192 characters, since in some character encodings, each character takes up multiple bytes.

If you're reading the data from files, can't you just grab the filesize? Or read it in in chunks of 8KB?

毁梦 2024-08-13 12:21:04

你可以这样做,因为 unescape 已部分弃用

function byteCount( string ) {
    // UTF8
    return encodeURI(string).split(/%..|./).length - 1;
}

function truncateByBytes(string, byteSize) {
    // UTF8
    if (byteCount(string) > byteSize) {
        const charsArray = string.split('');
        let truncatedStringArray = [];
        let bytesCounter = 0;
        for (let i = 0; i < charsArray.length; i++) {
            bytesCounter += byteCount(charsArray[i]);
            if (bytesCounter <= byteSize) {
                truncatedStringArray.push(charsArray[i]);
            } else {
                break;
            }
        }
        return truncatedStringArray.join('');
    }
    return string;
}

You can do something like this since unescape is partially deprecated

function byteCount( string ) {
    // UTF8
    return encodeURI(string).split(/%..|./).length - 1;
}

function truncateByBytes(string, byteSize) {
    // UTF8
    if (byteCount(string) > byteSize) {
        const charsArray = string.split('');
        let truncatedStringArray = [];
        let bytesCounter = 0;
        for (let i = 0; i < charsArray.length; i++) {
            bytesCounter += byteCount(charsArray[i]);
            if (bytesCounter <= byteSize) {
                truncatedStringArray.push(charsArray[i]);
            } else {
                break;
            }
        }
        return truncatedStringArray.join('');
    }
    return string;
}
寂寞花火° 2024-08-13 12:21:04

正如 Dominic 所说,字符编码是问题所在 - 但是,如果您可以真正确保只处理 8 位字符(不太可能但可能),或者假设 16 位字符并将自己限制为一半的可用空间,即 4096 个字符,那么您可以尝试此操作。

不过,依靠 JS 来实现这一点并不是一个好主意,因为它可能会被简单地修改或忽略,并且您需要处理转义字符和编码等复杂问题。最好使用 JS 作为第一次机会过滤器,并使用任何可用的服务器端语言(这也将打开压缩)。

As Dominic says, character encoding is the problem - however if you can either really ensure that you'll only deal with 8-bit chars (unlikely but possible) or assume 16-bit chars and limit yourself to half the available space, i.e. 4096 chars then you could attempt this.

It's a bad idea to rely on JS for this though because it can be trivially modified or ignored and you have complications of escape chars and encoding to deal with for example. Better to use JS as a first-chance filter and use whatever server-side language you have available (which will also open up compression).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文