Java中如何正确计算字符串的长度?

发布于 2024-11-26 05:02:19 字数 204 浏览 0 评论 0 原文

我知道 String#lengthCharacter 中的各种方法或多或少适用于代码单元/代码点。

Java中实际返回Unicode标准(UAX#29)指定的结果的建议方法是什么?比如语言/区域设置、规范化和字素簇?

I know there is String#length and the various methods in Character which more or less work on code units/code points.

What is the suggested way in Java to actually return the result as specified by Unicode standards (UAX#29), taking things like language/locale, normalization and grapheme clusters into account?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

南汐寒笙箫 2024-12-03 05:02:19

Java 字符串长度 String.length() 的正常模型

指定为返回字符串中 char 值(“代码单元”)的数量。细绳。这是 Java 字符串长度的最常用定义;见下文。

您对基于支持数组/数组切片大小的 length 语义的描述1不正确。 length() 返回的值也是支持数组或数组切片的大小,这一事实仅仅是典型 Java 的实现细节类库。 String 不需要以这种方式实现。事实上,我想我已经看到过 Java String 的实现,但它并不是以这种方式实现的。


字符串长度的替代模型。

要获取字符串中的 Unicode 代码点数量,请使用 str.codePointCount(0, str.length()) - 请参阅 javadoc

要获取特定编码(即字符集)中字符串的大小(以字节为单位),请使用 str.getBytes(charset).length2

要处理特定于区域设置的问题,您可以使用 Normalizer 将字符串规范化为最适合您的用例的任何形式,然后使用上面的 codePointCount 。但在某些情况下,即使这样也行不通。例如,Unicode 标准显然不满足匈牙利字母计数规则。


使用 String.length() 通常是可以的

大多数应用程序使用 String.length() 的原因是大多数应用程序不关心计算人类中单词、文本等的字符数。中心方式。例如,如果我这样做:

String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());

"mum".length() 不返回代码点或者它不是语言上正确的字符计数实际上并不重要。它使用适合手头任务的模型来测量绳子的长度。它有效。

显然,当您进行多语言文本分析时,事情会变得更加复杂;例如搜索单词。但即便如此,如果您在开始之前规范化文本和参数,则大多数时候您可以安全地使用“代码单元”而不是“代码点”进行编码;即 length() 仍然有效。


1 - 此描述是针对问题的某些版本的。查看编辑历史记录...如果您有足够的代表点。
2 - 使用 str.getBytes(charset).length 需要进行编码并将其丢弃。可能有一种通用方法可以在没有该副本的情况下执行此操作。它将需要将 String 包装为 CharBuffer,创建一个没有支持的自定义 ByteBuffer 来充当字节计数器,然后使用 >Encoder.encode(...) 来计算字节数。注意:我还没有尝试过这个,并且我不建议尝试,除非您有明确的证据表明 getBytes(charset) 是一个重要的性能瓶颈。

The normal model of Java string length

String.length() is specified as returning the number of char values ("code units") in the String. That is the most generally useful definition of the length of a Java String; see below.

Your description1 of the semantics of length based on the size of the backing array/array slice is incorrect. The fact that the value returned by length() is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries. String does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.


Alternative models of string length.

To get the number of Unicode codepoints in a String use str.codePointCount(0, str.length()) -- see the javadoc.

To get the size (in bytes) of a String in a specific encoding (i.e. charset) use str.getBytes(charset).length2.

To deal with locale-specific issues, you can use Normalizer to normalize the String to whatever form is most appropriate to your use-case, and then use codePointCount as above. But in some cases, even this won't work; e.g. the Hungarian letter counting rules which the Unicode standard apparently doesn't cater for.


Using String.length() is generally OK

The reason that most applications use String.length() is that most applications are not concerned with counting the number of characters in words, texts, etcetera in a human-centric way. For instance, if I do this:

String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());

it really doesn't matter that "mum".length() is not returning code points or that it is not a linguistically correct character count. It is measuring the length of the string using the model that is appropriate to the task at hand. And it works.

Obviously, things get a bit more complicated when you do multilingual text analysis; e.g. searching for words. But even then, if you normalize your text and parameters before you start, you can safely code in terms of "code units" rather than "code points" most of the time; i.e. length() still works.


1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.
2 - Using str.getBytes(charset).length entails doing the encoding and throwing it away. There is possibly a general way to do this without that copy. It would entail wrapping the String as a CharBuffer, creating a custom ByteBuffer with no backing to act as a byte counter, and then using Encoder.encode(...) to count the bytes. Note: I have not tried this, and I would not recommend trying unless you have clear evidence that getBytes(charset) is a significant performance bottleneck.

亚希 2024-12-03 05:02:19

java.text.BreakIterator 能够迭代文本并可以报告“字符”、单词、句子和行边界。

考虑这段代码:

def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
  val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
  charIterator.setText(text)

  var result = 0
  while(charIterator.next() != BreakIterator.DONE) result += 1
  result
}

运行它:

scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!

scala> val length = length(text)
length: Int = 17

scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21 

使用代理对:

scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String =

java.text.BreakIterator is able to iterate over text and can report on "character", word, sentence and line boundaries.

Consider this code:

def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
  val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
  charIterator.setText(text)

  var result = 0
  while(charIterator.next() != BreakIterator.DONE) result += 1
  result
}

Running it:

scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!

scala> val length = length(text)
length: Int = 17

scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21 

With surrogate pairs:

scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String = ????surpíse!????

scala> val length = length(parens)
length: Int = 10

scala> val codepoints = parens.codePointCount(0, parens.length)
codepoints: Int = 11

scala> val codeunits = parens.length
codeunits: Int = 13

This should do the job in most cases.

绝對不後悔。 2024-12-03 05:02:19

这完全取决于“字符串的长度”的含义:

  • String.length() 返回字符串。这通常仅对编程相关任务有用,例如分配缓冲区,因为多字节编码可能会导致问题,这意味着 char 并不意味着一个 Unicode 代码点
  • String.codePointCount (int, int)Character.codePointCount(CharSequence,int,int) 均返回 String 中的 Unicode 代码点数量。这通常仅适用于需要将 String 视为一系列 Unicode 代码点的编程相关任务,而无需担心多字节编码干扰。
  • BreakIterator .getCharacterInstance(Locale) 可用于获取下一个 grapheme /Locale.html" rel="noreferrer">区域设置。多次使用此功能可以让您计算 String 中的字素数量。由于字素基本上是字母(在大多数情况下),此方法对于获取String 包含的可写字符数非常有用。本质上,此方法返回的数字与您手动计算 String 中的字母数量所得到的数字大致相同,这使得它对于调整用户界面大小和拆分 String 等操作非常有用,而无需进行任何操作。损坏数据。

为了让您了解每种不同的方法如何为完全相同的数据返回不同的长度,我创建了此类 快速生成 this 中包含的 Unicode 文本的长度页面,旨在提供多种不同语言与非英语字符的综合测试。以下是以三种不同方式规范化输入文件后执行该代码的结果(无规范化,NFCNFD):

Input UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFC Normalized UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFD Normalized UTF-8 String
>>  String.length() = 3554
>>  String.codePointCount(int,int) = 3554
>>  BreakIterator.getCharacterInstance(Locale) = 3386

如您所见,如果您使用 String.length,即使“外观相同”的 String 也可能给出不同的长度结果()String.codePointCount(int,int)

有关此主题和其他类似主题的更多信息,您应该阅读 这篇博文涵盖了使用 Java 正确处理 Unicode 的各种基础知识。

It depends on exactly what you mean by "length of [the] String":

  • String.length() returns the number of chars in the String. This is normally only useful for programming related tasks like allocating buffers because multi-byte encoding can cause problems which means one char doesn't mean one Unicode code point.
  • String.codePointCount(int, int) and Character.codePointCount(CharSequence,int,int) both return the number of Unicode code points in the String. This is normally only useful for programming related tasks that require looking at a String as a series of Unicode code points without needing to worry about multi-byte encoding interfering.
  • BreakIterator.getCharacterInstance(Locale) can be used to get the next grapheme in a String for the given Locale. Using this multiple times can allow you to count the number of graphemes in a String. Since graphemes are basically letters (in most circumstances) this method is useful for getting the number of writable characters the String contains. Essentially this method returns approximately the same number you would get if you manually counted the number of letters in the String, making it useful for things like sizing user interfaces and splitting Strings without corrupting the data.

To give you an idea of how each of the different methods can return different lengths for the exact same data, I created this class to quickly generate the lengths of the Unicode text contained within this page, which is designed to offer a comprehensive test of many different languages with non-English characters. Here is the results of executing that code after normalizing the input file in three different ways (no normalizing, NFC, NFD):

Input UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFC Normalized UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFD Normalized UTF-8 String
>>  String.length() = 3554
>>  String.codePointCount(int,int) = 3554
>>  BreakIterator.getCharacterInstance(Locale) = 3386

As you can see, even the "same-looking" String could give different results for the length if you use either String.length() or String.codePointCount(int,int).

For more information on this topic and other similar topics you should read this blog post that covers a variety of basics on using Java to properly handle Unicode.

紙鸢 2024-12-03 05:02:19

如果你的意思是,根据某种语言的语法规则来计算字符串的长度,那么答案是否定的,Java 中没有这样的算法,其他地方也没有。

除非算法还对文本进行完整的语义分析。

例如,在匈牙利语中,szzs 可以算作一个或两个字母,这取决于它们出现的单词的构成。(例如:ország 是 5 个字母,而 torzság 是 7 个字母。)

Uodate:如果您想要的只是 Unicode 标准字符计数(正如我指出的那样,它并不准确) ),改变你的字符串为 NFKC 形式,其中 java.text.Normalizer 可能是一个解决方案。

If you mean, counting the length of a string according to the grammatical rules of a language, then the answer is no, there's no such algorithm in Java, nor anywhere else.

Not unless the algorithm also does a full semantic analysis of the text.

In Hungarian for example sz and zs can count as one letter or two, which depends on the composition of the word they appear in. (E.g.: ország is 5 letters, whereas torzság is 7.)

Uodate: If all you want is the Unicode standard character count (which, as I pointed out, isn't accurate), transforming your string to the NFKC form with java.text.Normalizer could be a solution.

人海汹涌 2024-12-03 05:02:19

.indexOf()方法给出了一个提示:

int length = (yourString + "whatever").indexOf("whatever");

.indexOf() method gives a hint:

int length = (yourString + "whatever").indexOf("whatever");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文