从 Java 字符串中删除所有不可打印字符的最快方法
在 Java 中从 String
中删除所有不可打印字符的最快方法是什么?
到目前为止,我已经尝试并测量了 138 字节、131 个字符的字符串:
- String 的
replaceAll()
- 最慢方法- 517009 个结果/秒
- 预编译一个 Pattern,然后使用 Matcher 的
replaceAll()
- 637836 个结果/秒
- 使用 StringBuffer,使用
codepointAt()
逐一获取代码点并附加到 StringBuffer- 711946 个结果/秒
- 使用 StringBuffer,使用
charAt()
逐一获取字符并附加到 StringBuffer- 1052964 个结果/秒
- 预分配一个
char[]
缓冲区,使用charAt()
逐一获取字符并填充此缓冲区,然后转换回 String- 2022653 个结果/秒
- 预分配 2 个
char[]
缓冲区 - 新旧缓冲区,使用getChars()
一次获取现有字符串的所有字符,迭代旧缓冲区一对一填充新缓冲区,然后将新缓冲区转换为字符串 - 我自己最快的版本- 2502502 个结果/秒
- 具有 2 个缓冲区的相同内容 - 仅使用
byte[]
、getBytes()
并将编码指定为“utf-8”- 857485 个结果/秒
- 具有 2 个
byte[]
缓冲区的相同内容,但将编码指定为常量Charset.forName("utf-8")
- 791076 个结果/秒
- 与 2 个
byte[]
缓冲区相同的内容,但将编码指定为 1 字节本地编码(几乎不是明智之举)- 370164 个结果/秒
我最好的尝试如下:
char[] oldChars = new char[s.length()];
s.getChars(0, s.length(), oldChars, 0);
char[] newChars = new char[s.length()];
int newLen = 0;
for (int j = 0; j < s.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ') {
newChars[newLen] = ch;
newLen++;
}
}
s = new String(newChars, 0, newLen);
关于如何让它更快?
回答一个非常奇怪的问题的奖励积分:为什么直接使用“utf-8”字符集名称比使用预分配的静态 const Charset.forName("utf-8")
产生更好的性能?
来自棘轮怪人的更新
- 建议产生了令人印象深刻的每秒 3105590 个结果的性能,提高了 +24%!
- Ed Staub 的建议带来了另一项改进 - 每秒 3471017 个结果,比之前的最佳结果提高了 12%。
更新 2
我已尽力收集所有建议的解决方案及其交叉突变,并将其发布为 github 上的小型基准测试框架。目前它支持 17 种算法。其中之一是“特殊” - Voo1 算法(由 SO 用户 Voo 提供)采用复杂的反射技巧,从而实现了惊人的速度,但它很混乱JVM 字符串的状态,因此它是单独进行基准测试的。
欢迎您检查并运行它以确定您的盒子上的结果。这是我得到的结果的摘要。它的规格:
- Debian sid
- Linux 2.6.39-2-amd64 (x86_64)
- Java 从包
sun-java6-jdk-6.24-1
安装,JVM 将自身标识为- Java(TM) SE 运行时环境(版本 1.6.0_24-b07)
- Java HotSpot(TM) 64 位服务器虚拟机(版本 19.1-b02,混合模式)
给定不同的输入集,不同的算法最终会显示不同的结果数据。我在 3 种模式下运行了基准测试:
相同的单个字符串
此模式适用于由 StringSource
类作为常量提供的相同的单个字符串。摊牌是:
Ops / s │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint 994 727 │ ArrayOfByteUTF8String 918 611 │ ArrayOfByteUTF8Const 756 086 │ MatcherReplace 598 945 │ StringReplaceAll 460 045 │ ArrayOfByteWindows1251
以图表形式:
(来源:greycat.ru)
多个字符串,100% 的字符串包含控制字符
源字符串提供程序使用 (0..127) 字符集预先生成大量随机字符串 - 因此几乎所有字符串都包含至少一个控制字符。算法以循环方式从这个预先生成的数组接收字符串。
Ops / s │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1 893 176 │ ArrayOfByteUTF8String 817 127 │ ArrayOfByteUTF8Const 778 089 │ StringBuilderChar 734 754 │ StringBuilderCodePoint 377 829 │ ArrayOfByteWindows1251 224 140 │ MatcherReplace 211 104 │ StringReplaceAll
以图表形式表示:
(来源:greycat.ru)
多个字符串,1% 的字符串包含控制字符
与之前相同,但只有 1% 的字符串是使用控制字符生成的 - 其他 99% 是使用 [32..127] 字符集生成的,因此它们不能包含控制字符全部。这种合成负载最接近我所在的该算法的实际应用。
Ops / s │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String 939 055 │ StringBuilderChar 907 194 │ ArrayOfByteUTF8Const 841 963 │ StringBuilderCodePoint 606 465 │ MatcherReplace 501 555 │ StringReplaceAll 381 185 │ ArrayOfByteWindows1251
以图表形式表示:
(来源:greycat.ru)
我很难决定谁提供了最佳答案,但考虑到现实世界的应用程序最佳解决方案是由 Ed Staub 给出/启发的,我想标记他的答案是公平的。感谢所有参与本次活动的人,你们的意见非常有帮助且非常宝贵。请随意在您的机器上运行测试套件并提出更好的解决方案(有效的 JNI 解决方案,有人吗?)。
参考
- GitHub 存储库 以及基准测试套件
What is the fastest way to strip all non-printable characters from a String
in Java?
So far I've tried and measured on 138-byte, 131-character String:
- String's
replaceAll()
- slowest method- 517009 results / sec
- Precompile a Pattern, then use Matcher's
replaceAll()
- 637836 results / sec
- Use StringBuffer, get codepoints using
codepointAt()
one-by-one and append to StringBuffer- 711946 results / sec
- Use StringBuffer, get chars using
charAt()
one-by-one and append to StringBuffer- 1052964 results / sec
- Preallocate a
char[]
buffer, get chars usingcharAt()
one-by-one and fill this buffer, then convert back to String- 2022653 results / sec
- Preallocate 2
char[]
buffers - old and new, get all chars for existing String at once usinggetChars()
, iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version- 2502502 results / sec
- Same stuff with 2 buffers - only using
byte[]
,getBytes()
and specifying encoding as "utf-8"- 857485 results / sec
- Same stuff with 2
byte[]
buffers, but specifying encoding as a constantCharset.forName("utf-8")
- 791076 results / sec
- Same stuff with 2
byte[]
buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)- 370164 results / sec
My best try was the following:
char[] oldChars = new char[s.length()];
s.getChars(0, s.length(), oldChars, 0);
char[] newChars = new char[s.length()];
int newLen = 0;
for (int j = 0; j < s.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ') {
newChars[newLen] = ch;
newLen++;
}
}
s = new String(newChars, 0, newLen);
Any thoughts on how to make it even faster?
Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")
?
Update
- Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
- Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.
Update 2
I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.
You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:
- Debian sid
- Linux 2.6.39-2-amd64 (x86_64)
- Java installed from a package
sun-java6-jdk-6.24-1
, JVM identifies itself as- Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
- Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:
Same single string
This mode works on a same single string provided by StringSource
class as a constant. The showdown is:
Ops / s │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint 994 727 │ ArrayOfByteUTF8String 918 611 │ ArrayOfByteUTF8Const 756 086 │ MatcherReplace 598 945 │ StringReplaceAll 460 045 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
Multiple strings, 100% of strings contain control characters
Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.
Ops / s │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1 893 176 │ ArrayOfByteUTF8String 817 127 │ ArrayOfByteUTF8Const 778 089 │ StringBuilderChar 734 754 │ StringBuilderCodePoint 377 829 │ ArrayOfByteWindows1251 224 140 │ MatcherReplace 211 104 │ StringReplaceAll
In charted form:
(source: greycat.ru)
Multiple strings, 1% of strings contain control characters
Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.
Ops / s │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String 939 055 │ StringBuilderChar 907 194 │ ArrayOfByteUTF8Const 841 963 │ StringBuilderCodePoint 606 465 │ MatcherReplace 501 555 │ StringReplaceAll 381 185 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).
References
- GitHub repository with a benchmarking suite
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
使用 1 个字符数组可能会工作得更好一些
,并且我避免了重复调用 s.length();
另一个可能有效的微优化是
using 1 char array could work a bit better
and I avoided repeated calls to
s.length();
another micro-optimization that might work is
如果将此方法嵌入到不跨线程共享的类中是合理的,那么您可以重用缓冲区:
等等...
这是一个巨大的胜利 - 20% 左右,据我了解当前的最佳情况。
如果要在可能很大的字符串上使用它并且需要考虑内存“泄漏”,则可以使用弱引用。
If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
etc...
This is a big win - 20% or so, as I understand the current best case.
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.
嗯,根据我的测量,我已经击败了当前最好的方法(使用预分配数组的怪物解决方案)大约 30%。如何?通过出卖我的灵魂。
我确信到目前为止关注过讨论的每个人都知道这几乎违反了任何基本的编程原则,但是哦,好吧。无论如何,只有当字符串的使用字符数组不在其他字符串之间共享时,以下内容才有效 - 如果确实如此,那么任何必须调试它的人都将有权决定杀死你(无需调用 substring() 并在文字字符串上使用它这应该可以工作,因为我不明白为什么 JVM 会保留从外部源读取的唯一字符串)。不过,不要忘记确保基准测试代码不会这样做——这是极有可能的,并且显然会对反射解决方案有所帮助。
无论如何,我们开始吧:
对于我的测试字符串,它得到
3477148.18ops/s
与旧版本的2616120.89ops/s
。我很确定解决这个问题的唯一方法可能是用 C 语言编写它(但可能不是),或者到目前为止还没有人考虑过的一些完全不同的方法。虽然我绝对不确定不同平台上的计时是否稳定 - 至少在我的机器(Java7、Win7 x64)上产生可靠的结果。Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.
As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.
Anyways here we go:
For my teststring that gets
3477148.18ops/s
vs.2616120.89ops/s
for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.您可以根据处理器的数量将任务拆分为多个并行子任务。
You could split the task into a several parallel subtasks, depending of processor's quantity.
我很自由,为不同的算法写了一个小基准。它并不完美,但我在随机字符串上运行给定算法 10000 次(默认情况下大约有 32/200% 不可打印),最少运行 1000 次。这应该处理诸如 GC、初始化等问题 - 没有太多的开销以至于任何算法都不应该在没有太多阻碍的情况下至少运行一次。
没有特别详细的记录,但是哦,很好。 我们开始 - 我包括了棘轮怪胎的算法和基本版本。目前,我随机初始化一个 200 个字符长的字符串,其中字符在 [0, 200) 范围内均匀分布。
I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.
Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).
IANA 低级 java 性能迷,但是您尝试过 展开主循环吗?看起来它可以允许某些 CPU 并行执行检查。
此外,这有一些有趣的优化想法。
IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.
Also, this has some fun ideas for optimizations.
它可以走得更快。更快*。如何?通过利用
System.arraycopy
这是本机
方法。回顾一下:如果“干净”,则返回相同的
字符串
。避免在每次迭代时分配新的
char[]
使用
System.arraycopy
将元素x
位置移回该类不是线程安全的,但我想如果有人想在单独的线程上处理大量字符串,那么他们可以在
ThreadLocal
琐事中提供 4-8 个StripAlgorithm
实现实例
我用作参考
RatchetFreak2EdStaub1GreyCat2
解决方案。我很惊讶这在我的机器上表现不佳。。然后我错误地认为“救助”机制不起作用,最后我把它移走了。它的性能飙升。然后我想“等一下”,然后我意识到这种情况总是有效,只是最后会更好。我不知道为什么。<前><代码> ...
6.RatchetFreak2EdStaub1GreyCat早期保释 3508771.93 3.54x +3.9%
...
2.RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%
该测试并非 100% 准确。起初我是一个利己主义者,我把测试放在了一系列算法上。它在第一次运行时出现了一些糟糕的结果,然后我在最后移动了它(让其他人为我预热 JVM:)),然后它首先出现。
结果
哦,当然还有结果。 Windows 7,jdk1.8.0_111 在相对较旧的计算机上,因此预计在较新的硬件和/或操作系统上会出现不同的结果。
* Reflection -Voo 的回答
我在
更快
语句上加了一个星号。我认为在这种情况下,没有什么比反思更快的了。它改变字符串的内部状态并避免新的字符串分配。我不认为有人能打败它。我尝试取消注释并运行 Voo 的算法,但出现
offset
字段不退出的错误。 IntelliJ 抱怨它也无法解析count
。另外(如果我没有记错的话)安全管理器可能会切断对私有字段的反射访问,因此该解决方案将不起作用。这就是为什么这个算法没有出现在我的测试运行中。否则我很想看看自己,尽管我相信非反射解决方案不可能更快。It can go even faster. Much faster*. How? By leveraging
System.arraycopy
which isnative
method. So to recap:Return the same
String
if it's "clean".Avoid allocating a new
char[]
on every iterationUse
System.arraycopy
for moving the elementsx
positions backThis class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the
StripAlgorithm
implementation inside aThreadLocal<>
Trivia
I used as reference the
RatchetFreak2EdStaub1GreyCat2
solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.
Results
Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.
* Reflection -Voo's answer
I've put an asterisk on the
Much faster
statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.I tried to uncomment and run Voo's algorithm and I got an error that
offset
field doesn't exit. IntelliJ complains that it can't resolvecount
either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.如果您的意思是
String#getBytes("utf-8")
等:这不应该更快 - 除了一些更好的缓存 - 因为Charset.forName("utf-8")如果未缓存字符集,则在内部使用
。一件事可能是您使用不同的字符集(或者您的某些代码可能是透明的),但
StringCoding
中缓存的字符集不会改变。If you mean
String#getBytes("utf-8")
etc.: This shouldn't be faster - except for some better caching - sinceCharset.forName("utf-8")
is used internally, if the charset is not cached.One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in
StringCoding
doesn't change.