在Java中迭代字符串的字符最简单/最好/最正确的方法是什么?
在 Java 中迭代字符串的字符的一些方法是:
- 使用
StringTokenizer
? - 将
String
转换为char[]
并对其进行迭代。
最简单/最好/最正确的迭代方法是什么?
Some ways to iterate through the characters of a string in Java are:
- Using
StringTokenizer
? - Converting the
String
to achar[]
and iterating over that.
What is the easiest/best/most correct way to iterate?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(17)
详细阐述这个答案和这个回答。
上面的答案指出了这里许多解决方案的问题,这些解决方案不按代码点值进行迭代 - 他们会对任何 代理字符。 java 文档还在此处概述了该问题(参见“Unicode 字符表示”)。 无论如何,这里有一些代码使用补充 Unicode 集中的一些实际代理字符,并将它们重新转换为字符串。 请注意,.toChars() 返回一个字符数组:如果您正在处理代理,则必须有两个字符。 此代码应该适用于任何 Unicode 字符。
Elaborating on this answer and this answer.
Above answers point out the problem of many of the solutions here which don't iterate by code point value -- they would have trouble with any surrogate chars. The java docs also outline the issue here (see "Unicode Character Representations"). Anyhow, here's some code that uses some actual surrogate chars from the supplementary Unicode set, and converts them back to a String. Note that .toChars() returns an array of chars: if you're dealing with surrogates, you'll necessarily have two chars. This code should work for any Unicode character.
此示例代码将为您提供帮助!
This Example Code will Help you out!
因此,通常有两种方法可以在java中迭代字符串,这已经被本线程中的多人回答了,只需添加我的版本即可
第一个是使用
如果性能受到威胁,那么我会建议在恒定时间内使用第一个,如果不是,那么考虑到 java 中字符串类的不变性,使用第二个会让你的工作更容易。
So typically there are two ways to iterate through string in java which has already been answered by multiple people here in this thread, just adding my version of it
First is using
If performance is at stake then I will recommend using the first one in constant time, if it is not then going with the second one makes your work easier considering the immutability with string classes in java.
如果您需要将所有字符一一作为字符串,您可以使用以下命令:
If you need all characters one by one as String you can use this:
我使用 for 循环来迭代字符串,并使用 charAt() 来获取每个字符来检查它。 由于 String 是用数组实现的,因此
charAt()
方法是一个恒定时间操作。这就是我会做的。 这对我来说似乎是最简单的。
就正确性而言,我不相信这里存在这种情况。 这完全取决于您的个人风格。
I use a for loop to iterate the string and use
charAt()
to get each character to examine it. Since the String is implemented with an array, thecharAt()
method is a constant time operation.That's what I would do. It seems the easiest to me.
As far as correctness goes, I don't believe that exists here. It is all based on your personal style.
两个选项
或
第一个可能更快,然后第二个可能更具可读性。
Two options
or
The first is probably faster, then 2nd is probably more readable.
请注意,如果您处理的是 BMP 之外的字符(Unicode ,则此处描述的大多数其他技术都会失败。基本多语言平面),即 u0000-uFFFF 范围之外的代码点 。 这种情况很少发生,因为此之外的代码点大多分配给死语言。 但除此之外还有一些有用的字符,例如一些用于数学符号的代码点,以及一些用于编码中文专有名称的代码点。
在这种情况下,您的代码将是:
Character.charCount(int)
方法需要 Java 5+。来源:http://mindprod.com/jgloss/codepoint.html
Note most of the other techniques described here break down if you're dealing with characters outside of the BMP (Unicode Basic Multilingual Plane), i.e. code points that are outside of the u0000-uFFFF range. This will only happen rarely, since the code points outside this are mostly assigned to dead languages. But there are some useful characters outside this, for example some code points used for mathematical notation, and some used to encode proper names in Chinese.
In that case your code will be:
The
Character.charCount(int)
method requires Java 5+.Source: http://mindprod.com/jgloss/codepoint.html
在 Java 8 中,我们可以这样解决:
chars() 方法返回一个
IntStream
,如 doc:根据文档,方法
codePoints()
还返回一个IntStream
:字符和代码点有何不同?如这篇文章:
最后为什么
forEachOrdered
而不是forEach
?forEach
的行为明显是不确定的,而forEachOrdered< /code> 如果流具有已定义的遇到顺序,则按流的遇到顺序对此流的每个元素执行操作。 所以
forEach
不保证顺序会被保留。 另请检查此问题了解更多信息。对于字符、代码点、字形和字形之间的差异,请检查此问题。
In Java 8 we can solve it as:
The method chars() returns an
IntStream
as mentioned in doc:The method
codePoints()
also returns anIntStream
as per doc:How is char and code point different? As mentioned in this article:
Finally why
forEachOrdered
and notforEach
?The behaviour of
forEach
is explicitly nondeterministic where as theforEachOrdered
performs an action for each element of this stream, in the encounter order of the stream if the stream has a defined encounter order. SoforEach
does not guarantee that the order would be kept. Also check this question for more.For difference between a character, a code point, a glyph and a grapheme check this question.
我同意 StringTokenizer 在这里太过分了。 事实上,我尝试了上面的建议并花了时间。
我的测试相当简单:创建一个包含大约一百万个字符的 StringBuilder,将其转换为 String,并在转换为 char 数组后/使用 CharacterIterator 一千次后使用 charAt() 遍历每个字符串(当然要确保对字符串做一些事情,这样编译器就无法优化整个循环:-))。
我的 2.6 GHz Powerbook(这是一台 mac :-) )和 JDK 1.5 上的结果:
由于结果明显不同,最直接的方法似乎也是最快的方法。 有趣的是,StringBuilder 的 charAt() 似乎比 String 的 charAt() 稍微慢一些。
顺便说一句,我建议不要使用CharacterIterator,因为我认为它滥用“\uFFFF”字符作为“迭代结束”,这是一个非常糟糕的黑客行为。 在大型项目中,总是有两个人使用同一种黑客技术来达到两个不同的目的,并且代码会非常神秘地崩溃。
这是其中一项测试:
I agree that StringTokenizer is overkill here. Actually I tried out the suggestions above and took the time.
My test was fairly simple: create a StringBuilder with about a million characters, convert it to a String, and traverse each of them with charAt() / after converting to a char array / with a CharacterIterator a thousand times (of course making sure to do something on the string so the compiler can't optimize away the whole loop :-) ).
The result on my 2.6 GHz Powerbook (that's a mac :-) ) and JDK 1.5:
As the results are significantly different, the most straightforward way also seems to be the fastest one. Interestingly, charAt() of a StringBuilder seems to be slightly slower than the one of String.
BTW I suggest not to use CharacterIterator as I consider its abuse of the '\uFFFF' character as "end of iteration" a really awful hack. In big projects there's always two guys that use the same kind of hack for two different purposes and the code crashes really mysteriously.
Here's one of the tests:
有一些专门的类用于此目的:
There are some dedicated classes for this:
如果您的类路径上有 Guava,那么以下是一个非常可读的替代方案。 对于这种情况,Guava 甚至有一个相当合理的自定义 List 实现,因此这应该不会效率低下。
更新:正如 @Alex 指出的,Java 8 还有
CharSequence#chars
使用。 即使类型是 IntStream,因此它可以映射到字符,例如:If you have Guava on your classpath, the following is a pretty readable alternative. Guava even has a fairly sensible custom List implementation for this case, so this shouldn't be inefficient.
UPDATE: As @Alex noted, with Java 8 there's also
CharSequence#chars
to use. Even the type is IntStream, so it can be mapped to chars like:如果您需要迭代
String
的代码点(请参阅此答案),请使用更短的/ 更具可读性的方法是使用 < Java 8中添加的code>CharSequence#codePoints方法:或者直接使用流而不是for循环:
还有
CharSequence#chars
如果您想要字符流(尽管它是一个IntStream
,因为没有CharStream
)。If you need to iterate through the code points of a
String
(see this answer) a shorter / more readable way is to use theCharSequence#codePoints
method added in Java 8:or using the stream directly instead of a for loop:
There is also
CharSequence#chars
if you want a stream of the characters (although it is anIntStream
, since there is noCharStream
).如果您需要性能,那么您必须在您的环境中进行测试。 别无退路。
这里的示例代码:
在 Java online 我得到:
在 Android x86 API 17 上我得到:
If you need performance, then you must test on your environment. No other way.
Here example code:
On Java online I get:
On Android x86 API 17 I get:
我不会使用 StringTokenizer,因为它是 JDK 中遗留的类之一。
javadoc 说:
I wouldn't use
StringTokenizer
as it is one of classes in the JDK that's legacy.The javadoc says:
输出:
Output:
请参阅Java 教程:字符串。
将长度放入
int len
并使用for
循环。See The Java Tutorials: Strings.
Put the length into
int len
and usefor
loop.StringTokenizer 完全不适合将字符串分解为各个字符的任务。 使用 String#split() ,您可以通过使用不匹配任何内容的正则表达式来轻松做到这一点,例如:
但是 StringTokenizer 不使用正则表达式,并且您无法指定与之间的任何内容都不匹配的分隔符字符串人物。 有一个可爱的小技巧可以用来完成同样的事情:使用字符串本身作为分隔符字符串(使其中的每个字符都成为分隔符)并让它返回分隔符:
但是,我只提及这些选项是为了消除它们。 这两种技术都将原始字符串分解为单字符字符串而不是 char 基元,并且都以对象创建和字符串操作的形式涉及大量开销。 与在 for 循环中调用 charAt() 相比,后者几乎不会产生任何开销。
StringTokenizer is totally unsuited to the task of breaking a string into its individual characters. With
String#split()
you can do that easily by using a regex that matches nothing, e.g.:But StringTokenizer doesn't use regexes, and there's no delimiter string you can specify that will match the nothing between characters. There is one cute little hack you can use to accomplish the same thing: use the string itself as the delimiter string (making every character in it a delimiter) and have it return the delimiters:
However, I only mention these options for the purpose of dismissing them. Both techniques break the original string into one-character strings instead of char primitives, and both involve a great deal of overhead in the form of object creation and string manipulation. Compare that to calling charAt() in a for loop, which incurs virtually no overhead.