根据编码计算字符字节长度的有效方法

发布于 2024-08-30 16:58:01 字数 1306 浏览 2 评论 0原文

考虑到字符编码,计算字符字节长度的最有效方法是什么?编码只有在运行时才知道。例如,在 UTF-8 中,字符具有可变的字节长度,因此需要单独确定每个字符。到目前为止,我已经想出了这个:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

但这在循环中是笨拙且低效的,因为每次都需要创建一个新字符串。我在Java API中找不到其他更有效的方法。有一个 String #valueOf(char),但根据其来源,它的作用与上面基本相同。我想这可以通过像位移位这样的按位运算来完成,但这是我的弱点,我不确定如何在这里考虑编码:)

如果您质疑是否需要这样做,请检查< a href="https://stackoverflow.com/questions/2725897/reverse-massive-text-file-in-java/2725993#2725993">此主题。


更新:来自@Bkkbrad 在技术上是最有效的:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

但是@Stephen C > 指出,这还有更多问题。例如,可能存在也需要考虑的组合/代理字符。但这是另一个问题,需要在这一步之前的步骤中解决。

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

If you question the need for this, check this topic.


Update: the answer from @Bkkbrad is technically the most efficient:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不一样的天空 2024-09-06 16:58:02

尝试 Charset.forName("UTF-8").encode("string").limit(); 可能会更有效,也可能不会。

Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

献世佛 2024-09-06 16:58:01

使用 CharsetEncoder 并重用 < a href="http://java.sun.com/javase/6/docs/api/java/nio/CharBuffer.html" rel="noreferrer">CharBuffer 作为输入和 ByteBuffer 作为输出。

在我的系统上,以下代码需要 25 秒来编码 100,000 个单个字符:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

但是,以下代码在 4 秒内完成相同的操作:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

编辑:为什么仇恨者必须仇恨?

这里有一个解决方案,内容如下来自 CharBuffer 并跟踪代理对

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}
南风起 2024-09-06 16:58:01

如果您可以保证输入是格式良好的 UTF-8,那么根本没有理由查找代码点。 UTF-8 的优点之一是您可以从字符串中的任何位置检测代码点的开头。只需向后搜索,直到找到 (b & 0xc0) != 0x80 的字节,并且您找到了另一个字符。由于 UTF-8 编码的代码点始终为 6 个字节或更少,因此您可以将中间字节复制到固定长度的缓冲区中。

编辑:我忘了提及,即使您不采用此策略,使用 Java“char”来存储任意代码点也是不够的,因为代码点值可能超过 0xffff。您需要将代码点存储在“int”中。

If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.

Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

我纯我任性 2024-09-06 16:58:01

编码方案可能可以将给定字符编码为可变数量的字节,具体取决于字符序列中该字符之前和之后的内容。因此,通过编码单个字符串获得的字节长度并不是完整的答案。

(例如,理论上您可以接收编码为每 3 个字节 4 个字符的 baudot / teletype 字符,或者理论上您可以将 UTF-16 + 流压缩器视为编码方案。是的,这一切都有点难以置信,但是。 ..)

It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.

(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文