我知道 String#length
和 Character
中的各种方法或多或少适用于代码单元/代码点。
Java中实际返回Unicode标准(UAX#29)指定的结果的建议方法是什么?比如语言/区域设置、规范化和字素簇?
I know there is String#length
and the various methods in Character
which more or less work on code units/code points.
What is the suggested way in Java to actually return the result as specified by Unicode standards (UAX#29), taking things like language/locale, normalization and grapheme clusters into account?
发布评论
评论(5)
Java 字符串长度
String.length()
的正常模型指定为返回字符串中
char
值(“代码单元”)的数量。细绳。这是 Java 字符串长度的最常用定义;见下文。您对基于支持数组/数组切片大小的
length
语义的描述1不正确。length()
返回的值也是支持数组或数组切片的大小,这一事实仅仅是典型 Java 的实现细节类库。String
不需要以这种方式实现。事实上,我想我已经看到过 Java String 的实现,但它并不是以这种方式实现的。字符串长度的替代模型。
要获取字符串中的 Unicode 代码点数量,请使用
str.codePointCount(0, str.length())
- 请参阅 javadoc。要获取特定编码(即字符集)中字符串的大小(以字节为单位),请使用
str.getBytes(charset).length
2。要处理特定于区域设置的问题,您可以使用
Normalizer
将字符串规范化为最适合您的用例的任何形式,然后使用上面的codePointCount
。但在某些情况下,即使这样也行不通。例如,Unicode 标准显然不满足匈牙利字母计数规则。使用 String.length() 通常是可以的
大多数应用程序使用 String.length() 的原因是大多数应用程序不关心计算人类中单词、文本等的字符数。中心方式。例如,如果我这样做:
"mum".length()
不返回代码点或者它不是语言上正确的字符计数实际上并不重要。它使用适合手头任务的模型来测量绳子的长度。它有效。显然,当您进行多语言文本分析时,事情会变得更加复杂;例如搜索单词。但即便如此,如果您在开始之前规范化文本和参数,则大多数时候您可以安全地使用“代码单元”而不是“代码点”进行编码;即
length()
仍然有效。1 - 此描述是针对问题的某些版本的。查看编辑历史记录...如果您有足够的代表点。
2 - 使用
str.getBytes(charset).length
需要进行编码并将其丢弃。可能有一种通用方法可以在没有该副本的情况下执行此操作。它将需要将String
包装为CharBuffer
,创建一个没有支持的自定义ByteBuffer
来充当字节计数器,然后使用>Encoder.encode(...)
来计算字节数。注意:我还没有尝试过这个,并且我不建议尝试,除非您有明确的证据表明 getBytes(charset) 是一个重要的性能瓶颈。The normal model of Java string length
String.length()
is specified as returning the number ofchar
values ("code units") in the String. That is the most generally useful definition of the length of a Java String; see below.Your description1 of the semantics of
length
based on the size of the backing array/array slice is incorrect. The fact that the value returned bylength()
is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries.String
does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.Alternative models of string length.
To get the number of Unicode codepoints in a String use
str.codePointCount(0, str.length())
-- see the javadoc.To get the size (in bytes) of a String in a specific encoding (i.e. charset) use
str.getBytes(charset).length
2.To deal with locale-specific issues, you can use
Normalizer
to normalize the String to whatever form is most appropriate to your use-case, and then usecodePointCount
as above. But in some cases, even this won't work; e.g. the Hungarian letter counting rules which the Unicode standard apparently doesn't cater for.Using String.length() is generally OK
The reason that most applications use
String.length()
is that most applications are not concerned with counting the number of characters in words, texts, etcetera in a human-centric way. For instance, if I do this:it really doesn't matter that
"mum".length()
is not returning code points or that it is not a linguistically correct character count. It is measuring the length of the string using the model that is appropriate to the task at hand. And it works.Obviously, things get a bit more complicated when you do multilingual text analysis; e.g. searching for words. But even then, if you normalize your text and parameters before you start, you can safely code in terms of "code units" rather than "code points" most of the time; i.e.
length()
still works.1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.
2 - Using
str.getBytes(charset).length
entails doing the encoding and throwing it away. There is possibly a general way to do this without that copy. It would entail wrapping theString
as aCharBuffer
, creating a customByteBuffer
with no backing to act as a byte counter, and then usingEncoder.encode(...)
to count the bytes. Note: I have not tried this, and I would not recommend trying unless you have clear evidence thatgetBytes(charset)
is a significant performance bottleneck.java.text.BreakIterator
能够迭代文本并可以报告“字符”、单词、句子和行边界。考虑这段代码:
运行它:
使用代理对:
java.text.BreakIterator
is able to iterate over text and can report on "character", word, sentence and line boundaries.Consider this code:
Running it:
With surrogate pairs:
This should do the job in most cases.
这完全取决于“字符串的长度”的含义:
String.length()
返回字符串
。这通常仅对编程相关任务有用,例如分配缓冲区,因为多字节编码可能会导致问题,这意味着char
并不意味着一个 Unicode 代码点。String.codePointCount (int, int)
和Character.codePointCount(CharSequence,int,int)
均返回String
中的 Unicode 代码点数量。这通常仅适用于需要将String
视为一系列 Unicode 代码点的编程相关任务,而无需担心多字节编码干扰。BreakIterator .getCharacterInstance(Locale)
可用于获取下一个 grapheme /Locale.html" rel="noreferrer">区域设置
。多次使用此功能可以让您计算String
中的字素数量。由于字素基本上是字母(在大多数情况下),此方法对于获取String
包含的可写字符数非常有用。本质上,此方法返回的数字与您手动计算 String 中的字母数量所得到的数字大致相同,这使得它对于调整用户界面大小和拆分 String 等操作非常有用,而无需进行任何操作。损坏数据。为了让您了解每种不同的方法如何为完全相同的数据返回不同的长度,我创建了此类 快速生成 this 中包含的 Unicode 文本的长度页面,旨在提供多种不同语言与非英语字符的综合测试。以下是以三种不同方式规范化输入文件后执行该代码的结果(无规范化,NFC,NFD):
如您所见,如果您使用
String.length,即使“外观相同”的
或String
也可能给出不同的长度结果()String.codePointCount(int,int)
。有关此主题和其他类似主题的更多信息,您应该阅读 这篇博文涵盖了使用 Java 正确处理 Unicode 的各种基础知识。
It depends on exactly what you mean by "length of [the] String":
String.length()
returns the number ofchars
in theString
. This is normally only useful for programming related tasks like allocating buffers because multi-byte encoding can cause problems which means onechar
doesn't mean one Unicode code point.String.codePointCount(int, int)
andCharacter.codePointCount(CharSequence,int,int)
both return the number of Unicode code points in theString
. This is normally only useful for programming related tasks that require looking at aString
as a series of Unicode code points without needing to worry about multi-byte encoding interfering.BreakIterator.getCharacterInstance(Locale)
can be used to get the next grapheme in aString
for the givenLocale
. Using this multiple times can allow you to count the number of graphemes in aString
. Since graphemes are basically letters (in most circumstances) this method is useful for getting the number of writable characters theString
contains. Essentially this method returns approximately the same number you would get if you manually counted the number of letters in theString
, making it useful for things like sizing user interfaces and splittingStrings
without corrupting the data.To give you an idea of how each of the different methods can return different lengths for the exact same data, I created this class to quickly generate the lengths of the Unicode text contained within this page, which is designed to offer a comprehensive test of many different languages with non-English characters. Here is the results of executing that code after normalizing the input file in three different ways (no normalizing, NFC, NFD):
As you can see, even the "same-looking"
String
could give different results for the length if you use eitherString.length()
orString.codePointCount(int,int)
.For more information on this topic and other similar topics you should read this blog post that covers a variety of basics on using Java to properly handle Unicode.
如果你的意思是,根据某种语言的语法规则来计算字符串的长度,那么答案是否定的,Java 中没有这样的算法,其他地方也没有。
除非算法还对文本进行完整的语义分析。
例如,在匈牙利语中,
sz
和zs
可以算作一个或两个字母,这取决于它们出现的单词的构成。(例如:ország 是 5 个字母,而
torzság
是 7 个字母。)Uodate:如果您想要的只是 Unicode 标准字符计数(正如我指出的那样,它并不准确) ),改变你的字符串为
NFKC
形式,其中java.text.Normalizer
可能是一个解决方案。If you mean, counting the length of a string according to the grammatical rules of a language, then the answer is no, there's no such algorithm in Java, nor anywhere else.
Not unless the algorithm also does a full semantic analysis of the text.
In Hungarian for example
sz
andzs
can count as one letter or two, which depends on the composition of the word they appear in. (E.g.:ország
is 5 letters, whereastorzság
is 7.)Uodate: If all you want is the Unicode standard character count (which, as I pointed out, isn't accurate), transforming your string to the
NFKC
form withjava.text.Normalizer
could be a solution..indexOf()方法给出了一个提示:
.indexOf() method gives a hint: