从 Java 字符串中删除所有不可打印字符的最快方法

发布于 2024-12-01 03:13:18 字数 5906 浏览 2 评论 0原文

在 Java 中从 String 中删除所有不可打印字符的最快方法是什么?

到目前为止,我已经尝试并测量了 138 字节、131 个字符的字符串:

  • String 的 replaceAll() - 最慢​​方法
    • 517009 个结果/秒
  • 预编译一个 Pattern,然后使用 Matcher 的 replaceAll()
    • 637836 个结果/秒
  • 使用 StringBuffer,使用 codepointAt() 逐一获取代码点并附加到 StringBuffer
    • 711946 个结果/秒
  • 使用 StringBuffer,使用 charAt() 逐一获取字符并附加到 StringBuffer
    • 1052964 个结果/秒
  • 预分配一个 char[] 缓冲区,使用 charAt() 逐一获取字符并填充此缓冲区,然后转换回 String
    • 2022653 个结果/秒
  • 预分配 2 个 char[] 缓冲区 - 新旧缓冲区,使用 getChars() 一次获取现有字符串的所有字符,迭代旧缓冲区一对一填充新缓冲区,然后将新缓冲区转换为字符串 - 我自己最快的版本
    • 2502502 个结果/秒
  • 具有 2 个缓冲区的相同内容 - 仅使用 byte[]getBytes() 并将编码指定为“utf-8”
    • 857485 个结果/秒
  • 具有 2 个 byte[] 缓冲区的相同内容,但将编码指定为常量 Charset.forName("utf-8")
    • 791076 个结果/秒
  • 与 2 个 byte[] 缓冲区相同的内容,但将编码指定为 1 字节本地编码(几乎不是明智之举)
    • 370164 个结果/秒

我最好的尝试如下:

    char[] oldChars = new char[s.length()];
    s.getChars(0, s.length(), oldChars, 0);
    char[] newChars = new char[s.length()];
    int newLen = 0;
    for (int j = 0; j < s.length(); j++) {
        char ch = oldChars[j];
        if (ch >= ' ') {
            newChars[newLen] = ch;
            newLen++;
        }
    }
    s = new String(newChars, 0, newLen);

关于如何让它更快?

回答一个非常奇怪的问题的奖励积分:为什么直接使用“utf-8”字符集名称比使用预分配的静态 const Charset.forName("utf-8") 产生更好的性能?

来自棘轮怪人的更新

  • 建议产生了令人印象深刻的每秒 3105590 个结果的性能,提高了 +24%!
  • Ed Staub 的建议带来了另一项改进 - 每秒 3471017 个结果,比之前的最佳结果提高了 12%。

更新 2

我已尽力收集所有建议的解决方案及其交叉突变,并将其发布为 github 上的小型基准测试框架。目前它支持 17 种算法。其中之一是“特殊” - Voo1 算法(由 SO 用户 Voo 提供)采用复杂的反射技巧,从而实现了惊人的速度,但它很混乱JVM 字符串的状态,因此它是单独进行基准测试的。

欢迎您检查并运行它以确定您的盒子上的结果。这是我得到的结果的摘要。它的规格:

  • Debian sid
  • Linux 2.6.39-2-amd64 (x86_64)
  • Java 从包 sun-java6-jdk-6.24-1 安装,JVM 将自身标识为
    • Java(TM) SE 运行时环境(版本 1.6.0_24-b07)
    • Java HotSpot(TM) 64 位服务器虚拟机(版本 19.1-b02,混合模式)

给定不同的输入集,不同的算法最终会显示不同的结果数据。我在 3 种模式下运行了基准测试:

相同的单个字符串

此模式适用于由 StringSource 类作为常量提供的相同的单个字符串。摊牌是:

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
  994 727 │ ArrayOfByteUTF8String
  918 611 │ ArrayOfByteUTF8Const
  756 086 │ MatcherReplace
  598 945 │ StringReplaceAll
  460 045 │ ArrayOfByteWindows1251

以图表形式: 相同的单个字符串图表
(来源:greycat.ru

多个字符串,100% 的字符串包含控制字符

源字符串提供程序使用 (0..127) 字符集预先生成大量随机字符串 - 因此几乎所有字符串都包含至少一个控制字符。算法以循环方式从这个预先生成的数组接收字符串。

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
  893 176 │ ArrayOfByteUTF8String
  817 127 │ ArrayOfByteUTF8Const
  778 089 │ StringBuilderChar
  734 754 │ StringBuilderCodePoint
  377 829 │ ArrayOfByteWindows1251
  224 140 │ MatcherReplace
  211 104 │ StringReplaceAll

以图表形式表示: 多个字符串, 100%浓度
(来源:greycat.ru

多个字符串,1% 的字符串包含控制字符

与之前相同,但只有 1% 的字符串是使用控制字符生成的 - 其他 99% 是使用 [32..127] 字符集生成的,因此它们不能包含控制字符全部。这种合成负载最接近我所在的该算法的实际应用。

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
  939 055 │ StringBuilderChar
  907 194 │ ArrayOfByteUTF8Const
  841 963 │ StringBuilderCodePoint
  606 465 │ MatcherReplace
  501 555 │ StringReplaceAll
  381 185 │ ArrayOfByteWindows1251

以图表形式表示: 多个字符串, 1%浓度
(来源:greycat.ru

我很难决定谁提供了最佳答案,但考虑到现实世界的应用程序最佳解决方案是由 Ed Staub 给出/启发的,我想标记他的答案是公平的。感谢所有参与本次活动的人,你们的意见非常有帮助且非常宝贵。请随意在您的机器上运行测试套件并提出更好的解决方案(有效的 JNI 解决方案,有人吗?)。

参考

What is the fastest way to strip all non-printable characters from a String in Java?

So far I've tried and measured on 138-byte, 131-character String:

  • String's replaceAll() - slowest method
    • 517009 results / sec
  • Precompile a Pattern, then use Matcher's replaceAll()
    • 637836 results / sec
  • Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
    • 711946 results / sec
  • Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
    • 1052964 results / sec
  • Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
    • 2022653 results / sec
  • Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
    • 2502502 results / sec
  • Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
    • 857485 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
    • 791076 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
    • 370164 results / sec

My best try was the following:

    char[] oldChars = new char[s.length()];
    s.getChars(0, s.length(), oldChars, 0);
    char[] newChars = new char[s.length()];
    int newLen = 0;
    for (int j = 0; j < s.length(); j++) {
        char ch = oldChars[j];
        if (ch >= ' ') {
            newChars[newLen] = ch;
            newLen++;
        }
    }
    s = new String(newChars, 0, newLen);

Any thoughts on how to make it even faster?

Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

Update

  • Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
  • Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

Update 2

I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:

  • Debian sid
  • Linux 2.6.39-2-amd64 (x86_64)
  • Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
    • Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
    • Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

Same single string

This mode works on a same single string provided by StringSource class as a constant. The showdown is:

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
  994 727 │ ArrayOfByteUTF8String
  918 611 │ ArrayOfByteUTF8Const
  756 086 │ MatcherReplace
  598 945 │ StringReplaceAll
  460 045 │ ArrayOfByteWindows1251

In charted form:
Same single string chart
(source: greycat.ru)

Multiple strings, 100% of strings contain control characters

Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
  893 176 │ ArrayOfByteUTF8String
  817 127 │ ArrayOfByteUTF8Const
  778 089 │ StringBuilderChar
  734 754 │ StringBuilderCodePoint
  377 829 │ ArrayOfByteWindows1251
  224 140 │ MatcherReplace
  211 104 │ StringReplaceAll

In charted form:
Multiple strings, 100% concentration
(source: greycat.ru)

Multiple strings, 1% of strings contain control characters

Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
  939 055 │ StringBuilderChar
  907 194 │ ArrayOfByteUTF8Const
  841 963 │ StringBuilderCodePoint
  606 465 │ MatcherReplace
  501 555 │ StringReplaceAll
  381 185 │ ArrayOfByteWindows1251

In charted form:
Multiple strings, 1% concentration
(source: greycat.ru)

It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

References

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

溺深海 2024-12-08 03:13:18

使用 1 个字符数组可能会工作得更好一些

int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);

,并且我避免了重复调用 s.length();

另一个可能有效的微优化是

int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
                       // if there are none it ends on the null char I appended
for (int  j = newLen; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);

using 1 char array could work a bit better

int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
                       // if there are none it ends on the null char I appended
for (int  j = newLen; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);
清风不识月 2024-12-08 03:13:18

如果将此方法嵌入到不跨线程共享的类中是合理的,那么您可以重用缓冲区:

char [] oldChars = new char[5];

String stripControlChars(String s)
{
    final int inputLen = s.length();
    if ( oldChars.length < inputLen )
    {
        oldChars = new char[inputLen];
    }
    s.getChars(0, inputLen, oldChars, 0);

等等...

这是一个巨大的胜利 - 20% 左右,据我了解当前的最佳情况。

如果要在可能很大的字符串上使用它并且需要考虑内存“泄漏”,则可以使用弱引用。

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];

String stripControlChars(String s)
{
    final int inputLen = s.length();
    if ( oldChars.length < inputLen )
    {
        oldChars = new char[inputLen];
    }
    s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

守护在此方 2024-12-08 03:13:18

嗯,根据我的测量,我已经击败了当前最好的方法(使用预分配数组的怪物解决方案)大约 30%。如何?通过出卖我的灵魂。

我确信到目前为止关注过讨论的每个人都知道这几乎违反了任何基本的编程原则,但是哦,好吧。无论如何,只有当字符串的使用字符数组不在其他字符串之间共享时,以下内容才有效 - 如果确实如此,那么任何必须调试它的人都将有权决定杀死你(无需调用 substring() 并在文字字符串上使用它这应该可以工作,因为我不明白为什么 JVM 会保留从外部源读取的唯一字符串)。不过,不要忘记确保基准测试代码不会这样做——这是极有可能的,并且显然会对反射解决方案有所帮助。

无论如何,我们开始吧:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise
    private Field value;
    private Field offset;
    private Field count;
    private Field hash;
    {
        try {
            value = String.class.getDeclaredField("value");
            value.setAccessible(true);
            offset = String.class.getDeclaredField("offset");
            offset.setAccessible(true);
            count = String.class.getDeclaredField("count");
            count.setAccessible(true);
            hash = String.class.getDeclaredField("hash");
            hash.setAccessible(true);               
        }
        catch (NoSuchFieldException e) {
            throw new RuntimeException();
        }

    }

    @Override
    public String strip(final String old) {
        final int length = old.length();
        char[] chars = null;
        int off = 0;
        try {
            chars = (char[]) value.get(old);
            off = offset.getInt(old);
        }
        catch(IllegalArgumentException e) {
            throw new RuntimeException(e);
        }
        catch(IllegalAccessException e) {
            throw new RuntimeException(e);
        }
        int newLen = off;
        for(int j = off; j < off + length; j++) {
            final char ch = chars[j];
            if (ch >= ' ') {
                chars[newLen] = ch;
                newLen++;
            }
        }
        if (newLen - off != length) {
            // We changed the internal state of the string, so at least
            // be friendly enough to correct it.
            try {
                count.setInt(old, newLen - off);
                // Have to recompute hash later on
                hash.setInt(old, 0);
            }
            catch(IllegalArgumentException e) {
                e.printStackTrace();
            }
            catch(IllegalAccessException e) {
                e.printStackTrace();
            }
        }
        // Well we have to return something
        return old;
    }

对于我的测试字符串,它得到 3477148.18ops/s 与旧版本的 2616120.89ops/s 。我很确定解决这个问题的唯一方法可能是用 C 语言编写它(但可能不是),或者到目前为止还没有人考虑过的一些完全不同的方法。虽然我绝对不确定不同平台上的计时是否稳定 - 至少在我的机器(Java7、Win7 x64)上产生可靠的结果。

Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.

As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.

Anyways here we go:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise
    private Field value;
    private Field offset;
    private Field count;
    private Field hash;
    {
        try {
            value = String.class.getDeclaredField("value");
            value.setAccessible(true);
            offset = String.class.getDeclaredField("offset");
            offset.setAccessible(true);
            count = String.class.getDeclaredField("count");
            count.setAccessible(true);
            hash = String.class.getDeclaredField("hash");
            hash.setAccessible(true);               
        }
        catch (NoSuchFieldException e) {
            throw new RuntimeException();
        }

    }

    @Override
    public String strip(final String old) {
        final int length = old.length();
        char[] chars = null;
        int off = 0;
        try {
            chars = (char[]) value.get(old);
            off = offset.getInt(old);
        }
        catch(IllegalArgumentException e) {
            throw new RuntimeException(e);
        }
        catch(IllegalAccessException e) {
            throw new RuntimeException(e);
        }
        int newLen = off;
        for(int j = off; j < off + length; j++) {
            final char ch = chars[j];
            if (ch >= ' ') {
                chars[newLen] = ch;
                newLen++;
            }
        }
        if (newLen - off != length) {
            // We changed the internal state of the string, so at least
            // be friendly enough to correct it.
            try {
                count.setInt(old, newLen - off);
                // Have to recompute hash later on
                hash.setInt(old, 0);
            }
            catch(IllegalArgumentException e) {
                e.printStackTrace();
            }
            catch(IllegalAccessException e) {
                e.printStackTrace();
            }
        }
        // Well we have to return something
        return old;
    }

For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.

怎言笑 2024-12-08 03:13:18

您可以根据处理器的数量将任务拆分为多个并行子任务。

You could split the task into a several parallel subtasks, depending of processor's quantity.

蝶…霜飞 2024-12-08 03:13:18

我很自由,为不同的算法写了一个小基准。它并不完美,但我在随机字符串上运行给定算法 10000 次(默认情况下大约有 32/200% 不可打印),最少运行 1000 次。这应该处理诸如 GC、初始化等问题 - 没有太多的开销以至于任何算法都不应该在没有太多阻碍的情况下至少运行一次。

没有特别详细的记录,但是哦,很好。 我们开始 - 我包括了棘轮怪胎的算法和基本版本。目前,我随机初始化一个 200 个字符长的字符串,其中字符在 [0, 200) 范围内均匀分布。

I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.

Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).

も星光 2024-12-08 03:13:18

IANA 低级 java 性能迷,但是您尝试过 展开主循环吗?看起来它可以允许某些 CPU 并行执行检查。

此外,有一些有趣的优化想法。

IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.

Also, this has some fun ideas for optimizations.

何时共饮酒 2024-12-08 03:13:18

它可以走得更快。更快*。如何?通过利用 System.arraycopy 这是本机 方法。回顾一下:

  • 如果“干净”,则返回相同的字符串

  • 避免在每次迭代时分配新的 char[]

  • 使用 System.arraycopy 将元素 x 位置移回

     公共类 SteliosAdamantidis 实现 StripAlgorithm {
    
          私有 char[] 复制 = 新 char[128];
    
          @覆盖
          public String strip(String s) 抛出异常 {
              int 长度 = s.length();
              if (长度>copy.length) {
                  int newLength = copy.length * 2;
                  while (长度> newLength) newLength *= 2;
                  复制=新字符[新长度];
              }
    
              s.getChars(0, 长度, 副本, 0);
    
              int 开始 = 0; //从哪里开始复制
              整数偏移量=0; //不可打印字符的数量或距离
                              //后面的字符应该被复制到
    
              整数索引 = 0;
              //快进到第一个不可打印字符
              for (; 索引 < 长度; ++index) {
                  if (复制[索引] < ' ') {
                      开始=索引;
                      休息;
                  }
              }
    
              //字符串已经干净了
              if (索引==长度) 返回 s;
    
              for (; 索引 < 长度; ++index) {
                  if (复制[索引] < ' ') {
                      if (开始!=索引) {
                          System.arraycopy(复制, 开始, 复制, 开始 - 偏移量, 索引 - 开始);
                      }
                      ++偏移量;
                      开始=索引+1; //处理后续的不可打印字符
                  }
              }
    
              如果(长度!=开始){
                  //复制剩余部分 - 如果有的话
                  System.arraycopy(复制,开始,复制,开始-偏移量,长度-开始);
              }
              返回新字符串(副本,0,长度-偏移量);
          }
      }
    

该类不是线程安全的,但我想如果有人想在单独的线程上处理大量字符串,那么他们可以在 ThreadLocal 琐事中提供 4-8 个 StripAlgorithm 实现

实例

  1. 我用作参考RatchetFreak2EdStaub1GreyCat2 解决方案。我很惊讶这在我的机器上表现不佳。。然后我错误地认为“救助”机制不起作用,最后我把它移走了。它的性能飙升。然后我想“等一下”,然后我意识到这种情况总是有效,只是最后会更好。我不知道为什么。

    <前><代码> ...
    6.RatchetFreak2EdStaub1GreyCat早期保释 3508771.93 3.54x +3.9%
    ...
    2.RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%

  2. 该测试并非 100% 准确。起初我是一个利己主义者,我把测试放在了一系列算法上。它在第一次运行时出现了一些糟糕的结果,然后我在最后移动了它(让其他人为我预热 JVM:)),然后它首先出现。

结果

哦,当然还有结果。 Windows 7,jdk1.8.0_111 在相对较旧的计算机上,因此预计在较新的硬件和/或操作系统上会出现不同的结果。

    Rankings: (1.000.000 strings)
    17. StringReplaceAll                        990099.01   1.00x   +0.0%
    16. ArrayOfByteWindows1251                  1642036.12  1.66x   +65.8%
    15. StringBuilderCodePoint                  1724137.93  1.74x   +5.0%
    14. ArrayOfByteUTF8Const                    2487562.19  2.51x   +44.3%
    13. StringBuilderChar                       2531645.57  2.56x   +1.8%
    12. ArrayOfByteUTF8String                   2551020.41  2.58x   +0.8%
    11. ArrayOfCharFromArrayOfChar              2824858.76  2.85x   +10.7%
    10. RatchetFreak2                           2923976.61  2.95x   +3.5%
     9. RatchetFreak1                           3076923.08  3.11x   +5.2%
     8. ArrayOfCharFromStringCharAt             3322259.14  3.36x   +8.0%
     7. EdStaub1                                3378378.38  3.41x   +1.7%
     6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93  3.54x   +3.9%
     5. EdStaub1GreyCat1                        3787878.79  3.83x   +8.0%
     4. MatcherReplace                          4716981.13  4.76x   +24.5%
     3. RatchetFreak2EdStaub1GreyCat1           5319148.94  5.37x   +12.8%
     2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06  6.12x   +13.9%
     1. SteliosAdamantidis                      9615384.62  9.71x   +58.7%

    Rankings: (10.000.000 strings)
    17. ArrayOfByteWindows1251                  1647175.09  1.00x   +0.0%
    16. StringBuilderCodePoint                  1728907.33  1.05x   +5.0%
    15. StringBuilderChar                       2480158.73  1.51x   +43.5%
    14. ArrayOfByteUTF8Const                    2498126.41  1.52x   +0.7%
    13. ArrayOfByteUTF8String                   2591344.91  1.57x   +3.7%
    12. StringReplaceAll                        2626740.22  1.59x   +1.4%
    11. ArrayOfCharFromArrayOfChar              2810567.73  1.71x   +7.0%
    10. RatchetFreak2                           2948113.21  1.79x   +4.9%
     9. RatchetFreak1                           3120124.80  1.89x   +5.8%
     8. ArrayOfCharFromStringCharAt             3306878.31  2.01x   +6.0%
     7. EdStaub1                                3399048.27  2.06x   +2.8%
     6. RatchetFreak2EdStaub1GreyCatEarlyBail   3494060.10  2.12x   +2.8%
     5. EdStaub1GreyCat1                        3818251.24  2.32x   +9.3%
     4. MatcherReplace                          4899559.04  2.97x   +28.3%
     3. RatchetFreak2EdStaub1GreyCat1           5302226.94  3.22x   +8.2%
     2. RatchetFreak2EdStaub1GreyCatLateBail    5924170.62  3.60x   +11.7%
     1. SteliosAdamantidis                      9680542.11  5.88x   +63.4%

* Reflection -Voo 的回答

我在 更快 语句上加了一个星号。我认为在这种情况下,没有什么比反思更快的了。它改变字符串的内部状态并避免新的字符串分配。我不认为有人能打败它。

我尝试取消注释并运行 Voo 的算法,但出现 offset 字段不退出的错误。 IntelliJ 抱怨它也无法解析 count 。另外(如果我没有记错的话)安全管理器可能会切断对私有字段的反射访问,因此该解决方案将不起作用。这就是为什么这个算法没有出现在我的测试运行中。否则我很想看看自己,尽管我相信非反射解决方案不可能更快。

It can go even faster. Much faster*. How? By leveraging System.arraycopy which is native method. So to recap:

  • Return the same String if it's "clean".

  • Avoid allocating a new char[] on every iteration

  • Use System.arraycopy for moving the elements x positions back

      public class SteliosAdamantidis implements StripAlgorithm {
    
          private char[] copy = new char[128];
    
          @Override
          public String strip(String s) throws Exception {
              int length = s.length();
              if (length > copy.length) {
                  int newLength = copy.length * 2;
                  while (length > newLength) newLength *= 2;
                  copy = new char[newLength];
              }
    
              s.getChars(0, length, copy, 0);
    
              int start = 0;  //where to start copying from
              int offset = 0; //number of non printable characters or how far
                              //behind the characters should be copied to
    
              int index = 0;
              //fast forward to the first non printable character
              for (; index < length; ++index) {
                  if (copy[index] < ' ') {
                      start = index;
                      break;
                  }
              }
    
              //string is already clean
              if (index == length) return s;
    
              for (; index < length; ++index) {
                  if (copy[index] < ' ') {
                      if (start != index) {
                          System.arraycopy(copy, start, copy, start - offset, index - start);
                      }
                      ++offset;
                      start = index + 1; //handling subsequent non printable characters
                  }
              }
    
              if (length != start) {
                  //copy the residue -if any
                  System.arraycopy(copy, start, copy, start - offset, length - start);
              }
              return new String(copy, 0, length - offset);
          }
      }
    

This class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the StripAlgorithm implementation inside a ThreadLocal<>

Trivia

  1. I used as reference the RatchetFreak2EdStaub1GreyCat2 solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.

     ...
     6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93   3.54x   +3.9%
     ...
     2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06   6.12x   +13.9%
    
  2. The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.

Results

Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.

    Rankings: (1.000.000 strings)
    17. StringReplaceAll                        990099.01   1.00x   +0.0%
    16. ArrayOfByteWindows1251                  1642036.12  1.66x   +65.8%
    15. StringBuilderCodePoint                  1724137.93  1.74x   +5.0%
    14. ArrayOfByteUTF8Const                    2487562.19  2.51x   +44.3%
    13. StringBuilderChar                       2531645.57  2.56x   +1.8%
    12. ArrayOfByteUTF8String                   2551020.41  2.58x   +0.8%
    11. ArrayOfCharFromArrayOfChar              2824858.76  2.85x   +10.7%
    10. RatchetFreak2                           2923976.61  2.95x   +3.5%
     9. RatchetFreak1                           3076923.08  3.11x   +5.2%
     8. ArrayOfCharFromStringCharAt             3322259.14  3.36x   +8.0%
     7. EdStaub1                                3378378.38  3.41x   +1.7%
     6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93  3.54x   +3.9%
     5. EdStaub1GreyCat1                        3787878.79  3.83x   +8.0%
     4. MatcherReplace                          4716981.13  4.76x   +24.5%
     3. RatchetFreak2EdStaub1GreyCat1           5319148.94  5.37x   +12.8%
     2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06  6.12x   +13.9%
     1. SteliosAdamantidis                      9615384.62  9.71x   +58.7%

    Rankings: (10.000.000 strings)
    17. ArrayOfByteWindows1251                  1647175.09  1.00x   +0.0%
    16. StringBuilderCodePoint                  1728907.33  1.05x   +5.0%
    15. StringBuilderChar                       2480158.73  1.51x   +43.5%
    14. ArrayOfByteUTF8Const                    2498126.41  1.52x   +0.7%
    13. ArrayOfByteUTF8String                   2591344.91  1.57x   +3.7%
    12. StringReplaceAll                        2626740.22  1.59x   +1.4%
    11. ArrayOfCharFromArrayOfChar              2810567.73  1.71x   +7.0%
    10. RatchetFreak2                           2948113.21  1.79x   +4.9%
     9. RatchetFreak1                           3120124.80  1.89x   +5.8%
     8. ArrayOfCharFromStringCharAt             3306878.31  2.01x   +6.0%
     7. EdStaub1                                3399048.27  2.06x   +2.8%
     6. RatchetFreak2EdStaub1GreyCatEarlyBail   3494060.10  2.12x   +2.8%
     5. EdStaub1GreyCat1                        3818251.24  2.32x   +9.3%
     4. MatcherReplace                          4899559.04  2.97x   +28.3%
     3. RatchetFreak2EdStaub1GreyCat1           5302226.94  3.22x   +8.2%
     2. RatchetFreak2EdStaub1GreyCatLateBail    5924170.62  3.60x   +11.7%
     1. SteliosAdamantidis                      9680542.11  5.88x   +63.4%

* Reflection -Voo's answer

I've put an asterisk on the Much faster statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.

I tried to uncomment and run Voo's algorithm and I got an error that offset field doesn't exit. IntelliJ complains that it can't resolve count either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.

骄兵必败 2024-12-08 03:13:18

为什么直接使用“utf-8”字符集名称比使用预分配的 static const Charset.forName(“utf-8”) 产生更好的性能?

如果您的意思是 String#getBytes("utf-8") 等:这不应该更快 - 除了一些更好的缓存 - 因为 Charset.forName("utf-8")如果未缓存字符集,则在内部使用

一件事可能是您使用不同的字符集(或者您的某些代码可能是透明的),但 StringCoding 中缓存的字符集不会改变。

why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

If you mean String#getBytes("utf-8") etc.: This shouldn't be faster - except for some better caching - since Charset.forName("utf-8") is used internally, if the charset is not cached.

One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in StringCoding doesn't change.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文