根据编码计算字符字节长度的有效方法

发布于 2024-08-30 16:58:01 字数 1306 浏览 2 评论 0原文

考虑到字符编码，计算字符字节长度的最有效方法是什么？编码只有在运行时才知道。例如，在 UTF-8 中，字符具有可变的字节长度，因此需要单独确定每个字符。到目前为止，我已经想出了这个：

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

但这在循环中是笨拙且低效的，因为每次都需要创建一个新字符串。我在Java API中找不到其他更有效的方法。有一个 String #valueOf(char)，但根据其来源，它的作用与上面基本相同。我想这可以通过像位移位这样的按位运算来完成，但这是我的弱点，我不确定如何在这里考虑编码:)

_{如果您质疑是否需要这样做，请检查< a href="https://stackoverflow.com/questions/2725897/reverse-massive-text-file-in-java/2725993#2725993">此主题。}

更新：来自@Bkkbrad 在技术上是最有效的：

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

但是@Stephen C > 指出，这还有更多问题。例如，可能存在也需要考虑的组合/代理字符。但这是另一个问题，需要在这一步之前的步骤中解决。

原文

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

_{If you question the need for this, check this topic.}

Update: the answer from @Bkkbrad is technically the most efficient:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不一样的天空 2024-09-06 16:58:02

尝试 Charset.forName("UTF-8").encode("string").limit(); 可能会更有效，也可能不会。

回复收藏 0 原文

献世佛 2024-09-06 16:58:01

使用 CharsetEncoder 并重用 < a href="http://java.sun.com/javase/6/docs/api/java/nio/CharBuffer.html" rel="noreferrer">CharBuffer 作为输入和 ByteBuffer 作为输出。

在我的系统上，以下代码需要 25 秒来编码 100,000 个单个字符：

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

但是，以下代码在 4 秒内完成相同的操作：

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

编辑：为什么仇恨者必须仇恨？

这里有一个解决方案，内容如下来自 CharBuffer 并跟踪代理对：

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

回复收藏 0 原文

南风起 2024-09-06 16:58:01

如果您可以保证输入是格式良好的 UTF-8，那么根本没有理由查找代码点。 UTF-8 的优点之一是您可以从字符串中的任何位置检测代码点的开头。只需向后搜索，直到找到 (b & 0xc0) != 0x80 的字节，并且您找到了另一个字符。由于 UTF-8 编码的代码点始终为 6 个字节或更少，因此您可以将中间字节复制到固定长度的缓冲区中。

编辑：我忘了提及，即使您不采用此策略，使用 Java“char”来存储任意代码点也是不够的，因为代码点值可能超过 0xffff。您需要将代码点存储在“int”中。

回复收藏 0 原文