使用 StringTokenizer 复制 String.split

发布于 2024-07-23 17:02:34 字数 470 浏览 9 评论 0原文

受到 this 的鼓励,而且我有数十亿个字符串需要解析,我尝试了修改我的代码以接受 StringTokenizer 而不是 String[]

在我和获得美味的 x2 性能提升之间唯一剩下的事情是当你在做的时候

"dog,,cat".split(",")
//output: ["dog","","cat"]

StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"

如何我使用 StringTokenizer 获得了类似的结果? 有没有更快的方法来做到这一点?

Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]

The only thing left between me and getting that delicious x2 performance boost is the fact that when you're doing

"dog,,cat".split(",")
//output: ["dog","","cat"]

StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"

How can I achieve similar results with the StringTokenizer? Are there faster ways to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

空心空情空意 2024-07-30 17:02:34

您实际上只对逗号进行标记吗? 如果是这样,我会编写自己的标记生成器 - 它最终可能比更通用的 StringTokenizer 更有效,后者可以查找多个标记,并且您可以使其按照您想要的方式运行。 对于这样一个简单的用例,它可以是一个简单的实现。

如果它有用,您甚至可以实现 Iterable并通过强类型获得增强的 for 循环支持,而不是 StringTokenizer 提供的 Enumeration 支持。 如果您需要任何帮助来编写这样一个野兽,请告诉我 - 这真的不应该太难。

此外,在脱离现有解决方案之前,我会尝试对实际数据运行性能测试。 您知道您的执行时间有多少实际花在String.split上吗? 我知道你有很多字符串需要解析,但是如果你之后对它们做任何重要的事情,我希望这比分割更重要。

Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.

If it would be useful, you could even implement Iterable<String> and get enhanced-for-loop support with strong typing instead of the Enumeration support provided by StringTokenizer. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.

Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.

冰魂雪魄 2024-07-30 17:02:34

修改 StringTokenizer< /a> 类,我找不到满足返回 ["dog", "", "cat"] 要求的方法。

此外,仅出于兼容性原因保留 StringTokenizer 类,并且鼓励使用 String.split。 来自 StringTokenizer 的 API 规范:

StringTokenizer 是一个遗留类
为兼容性而保留
原因虽然它的用途是
在新代码中不鼓励。 这是
建议任何寻求此的人
功能使用 split 方法
Stringjava.util.regex
而是打包。

由于问题是 String.split 方法,我们需要找到替代方法。

注意:我说“据说性能很差”,因为很难确定每个用例都会导致 StringTokenizer 优于 String.split方法。 此外,在许多情况下,除非字符串的标记化确实是通过适当的分析确定的应用程序的瓶颈,否则我认为这最终将是一种过早的优化(如果有的话)。 我倾向于说在进行优化之前先编写有意义且易于理解的代码。

现在,从当前的要求来看,滚动我们自己的标记生成器可能不会太困难。

推出我们自己的分词器!

以下是我编写的一个简单的分词器。 我应该注意到,没有速度优化,也没有错误检查来防止超过字符串末尾——这是一个快速而肮脏的实现:

class MyTokenizer implements Iterable<String>, Iterator<String> {
  String delim = ",";
  String s;
  int curIndex = 0;
  int nextIndex = 0;
  boolean nextIsLastToken = false;

  public MyTokenizer(String s, String delim) {
    this.s = s;
    this.delim = delim;
  }

  public Iterator<String> iterator() {
    return this;
  }

  public boolean hasNext() {
    nextIndex = s.indexOf(delim, curIndex);

    if (nextIsLastToken)
      return false;

    if (nextIndex == -1)
      nextIsLastToken = true;

    return true;
  }

  public String next() {
    if (nextIndex == -1)
      nextIndex = s.length();

    String token = s.substring(curIndex, nextIndex);
    curIndex = nextIndex + 1;

    return token;
  }

  public void remove() {
    throw new UnsupportedOperationException();
  }
}

MyTokenizer 将采用 < code>String 进行标记,并使用 String 作为分隔符,并使用 String.indexOf 方法执行分隔符搜索。 令牌由 String.substring 方法生成。

我怀疑通过在 char[] 级别而不是在 String 级别处理字符串可能会提高一些性能。 但我会将其作为练习留给读者。

该类还实现了 Iterable< /a> 和 迭代器< /a> 以便利用 Java 5 中引入的 for-each 循环构造。StringTokenizer 是一个 Enumerator,并且不支持 for-each 结构。

更快吗?

为了找出是否更快,我编写了一个程序来比较以下四种方法的速度:

  1. 使用StringTokenizer
  2. 使用新的MyTokenizer
  3. 使用String.split
  4. Pattern.compile

在这四种方法中,字符串“dog,,cat”被分成标记。 需要注意的是,它不会返回 ["dog", "", "cat] 所需的结果。

虽然比较中包含了 StringTokenizer,但 总共重复了 100 万次,以便有足够的时间来注意到方法中的差异

用于简单基准测试的代码如下:

long st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  StringTokenizer t = new StringTokenizer("dog,,cat", ",");
  while (t.hasMoreTokens()) {
    t.nextToken();
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  MyTokenizer mt = new MyTokenizer("dog,,cat", ",");
  for (String t : mt) {
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  String[] tokens = "dog,,cat".split(",");
  for (String t : tokens) {
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
Pattern p = Pattern.compile(",");
for (int i = 0; i < 1e6; i++) {
  String[] tokens = p.split("dog,,cat");
  for (String t : tokens) {
  }
}
System.out.println(System.currentTimeMillis() - st);

结果

使用 Java SE 6 运行测试。 (版本 1.6.0_12-b04),结果如下:

                   Run 1    Run 2    Run 3    Run 4    Run 5
                   -----    -----    -----    -----    -----
StringTokenizer      172      188      187      172      172
MyTokenizer          234      234      235      234      235
String.split        1172     1156     1171     1172     1156
Pattern.compile      906      891      891      907      906

因此,从有限的测试和仅五次运行中可以看出,StringTokenizer 实际上确实是最快的,但是 < code>MyTokenizer 紧随其后,String.split 是最慢的,预编译的正则表达式比 split 方法稍快。

与任何小基准一样,它可能不能很好地代表现实生活中的情况,因此对结果应该持保留态度。

After tinkering with the StringTokenizer class, I could not find a way to satisfy the requirements to return ["dog", "", "cat"].

Furthermore, the StringTokenizer class is left only for compatibility reasons, and the use of String.split is encouaged. From the API Specification for the StringTokenizer:

StringTokenizer is a legacy class
that is retained for compatibility
reasons although its use is
discouraged in new code. It is
recommended that anyone seeking this
functionality use the split method
of String or the java.util.regex
package instead.

Since the issue is the supposedly poor performance of the String.split method, we need to find an alternative.

Note: I am saying "supposedly poor performance" because it's hard to determine that every use case is going to result in the StringTokenizer being superior to the String.split method. Furthermore, in many cases, unless the tokenization of the strings are indeed the bottleneck of the application determined by proper profiling, I feel that it will end up being a premature optimization, if anything. I would be inclined to say write code that is meaningful and easy to understand before venturing on optimization.

Now, from the current requirements, probably rolling our own tokenizer wouldn't be too difficult.

Roll our own tokenzier!

The following is a simple tokenizer I wrote. I should note that there are no speed optimizations, nor is there error-checks to prevent going past the end of the string -- this is a quick-and-dirty implementation:

class MyTokenizer implements Iterable<String>, Iterator<String> {
  String delim = ",";
  String s;
  int curIndex = 0;
  int nextIndex = 0;
  boolean nextIsLastToken = false;

  public MyTokenizer(String s, String delim) {
    this.s = s;
    this.delim = delim;
  }

  public Iterator<String> iterator() {
    return this;
  }

  public boolean hasNext() {
    nextIndex = s.indexOf(delim, curIndex);

    if (nextIsLastToken)
      return false;

    if (nextIndex == -1)
      nextIsLastToken = true;

    return true;
  }

  public String next() {
    if (nextIndex == -1)
      nextIndex = s.length();

    String token = s.substring(curIndex, nextIndex);
    curIndex = nextIndex + 1;

    return token;
  }

  public void remove() {
    throw new UnsupportedOperationException();
  }
}

The MyTokenizer will take a String to tokenize and a String as a delimiter, and use the String.indexOf method to perform the search for delimiters. Tokens are produced by the String.substring method.

I would suspect there could be some performance improvements by working on the string at the char[] level rather than at the String level. But I'll leave that up as an exercise to the reader.

The class also implements Iterable and Iterator in order to take advantage of the for-each loop construct that was introduced in Java 5. StringTokenizer is an Enumerator, and does not support the for-each construct.

Is it any faster?

In order to find out if this is any faster, I wrote a program to compare speeds in the following four methods:

  1. Use of StringTokenizer.
  2. Use of the new MyTokenizer.
  3. Use of String.split.
  4. Use of precompiled regular expression by Pattern.compile.

In the four methods, the string "dog,,cat" was separated into tokens. Although the StringTokenizer is included in the comparison, it should be noted that it will not return the desired result of ["dog", "", "cat].

The tokenizing was repeated for a total of 1 million times to give take enough time to notice the difference in the methods.

The code used for the simple benchmark was the following:

long st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  StringTokenizer t = new StringTokenizer("dog,,cat", ",");
  while (t.hasMoreTokens()) {
    t.nextToken();
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  MyTokenizer mt = new MyTokenizer("dog,,cat", ",");
  for (String t : mt) {
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
  String[] tokens = "dog,,cat".split(",");
  for (String t : tokens) {
  }
}
System.out.println(System.currentTimeMillis() - st);

st = System.currentTimeMillis();
Pattern p = Pattern.compile(",");
for (int i = 0; i < 1e6; i++) {
  String[] tokens = p.split("dog,,cat");
  for (String t : tokens) {
  }
}
System.out.println(System.currentTimeMillis() - st);

The Results

The tests were run using Java SE 6 (build 1.6.0_12-b04), and results were the following:

                   Run 1    Run 2    Run 3    Run 4    Run 5
                   -----    -----    -----    -----    -----
StringTokenizer      172      188      187      172      172
MyTokenizer          234      234      235      234      235
String.split        1172     1156     1171     1172     1156
Pattern.compile      906      891      891      907      906

So, as can be seen from the limited testing and only five runs, the StringTokenizer did in fact come out the fastest, but the MyTokenizer came in as a close 2nd. Then, String.split was the slowest, and the precompiled regular expression was slightly faster than the split method.

As with any little benchmark, it probably isn't very representative of real-life conditions, so the results should be taken with a grain (or a mound) of salt.

旧故 2024-07-30 17:02:34

注意:经过一些快速基准测试,Scanner 的速度比 String.split 慢大约四倍。 因此,不要使用扫描仪。

(我留下这篇文章是为了记录扫描仪在这种情况下是一个坏主意的事实。(阅读为:请不要因为我建议扫描仪而对我投反对票......) )

假设您使用的是 Java 1.5 或更高版本,请尝试 Scanner,它实现了 Iterator,恰好:

Scanner sc = new Scanner("dog,,cat");
sc.useDelimiter(",");
while (sc.hasNext()) {
    System.out.println(sc.next());
}

给出:

dog

cat

Note: Having done some quick benchmarks, Scanner turns out to be about four times slower than String.split. Hence, do not use Scanner.

(I'm leaving the post up to record the fact that Scanner is a bad idea in this case. (Read as: do not downvote me for suggesting Scanner, please...))

Assuming you are using Java 1.5 or higher, try Scanner, which implements Iterator<String>, as it happens:

Scanner sc = new Scanner("dog,,cat");
sc.useDelimiter(",");
while (sc.hasNext()) {
    System.out.println(sc.next());
}

gives:

dog

cat
半暖夏伤 2024-07-30 17:02:34

根据您需要标记的字符串类型,您可以基于 String.indexOf() 编写自己的拆分器。 您还可以创建多核解决方案来进一步提高性能,因为字符串的标记化是相互独立的。 可以批量处理(假设)每个核心 100 个字符串。 执行 String.split() 或其他操作。

Depending on what kind of strings you need to tokenize, you can write your own splitter based on String.indexOf() for example. You could also create a multi-core solution to improve performance even further, as the tokenization of strings is independent from each other. Work on batches of -lets say- 100 strings per core. Do the String.split() or watever else.

哆兒滾 2024-07-30 17:02:34

您可以尝试使用 Apache Commons Lang 中的 StrTokenizer 类,而不是 StringTokenizer,我引用了该类:

该类可以将一个字符串分割成许多更小的字符串。 它的目标是完成与 StringTokenizer 类似的工作,但它提供了更多的控制和灵活性,包括实现 ListIterator 接口。

空标记可能会被删除或返回为空。

我想这听起来像是您所需要的?

Rather than StringTokenizer, you could try the StrTokenizer class from Apache Commons Lang, which I quote:

This class can split a String into many smaller strings. It aims to do a similar job to StringTokenizer, however it offers much more control and flexibility including implementing the ListIterator interface.

Empty tokens may be removed or returned as null.

This sounds like what you need, I think?

皓月长歌 2024-07-30 17:02:34

你可以做类似的事情。 它并不完美,但可能对你有用。

public static List<String> find(String test, char c) {
    List<String> list = new Vector<String>();
    start;
    int i=0;
    while (i<=test.length()) {
        int start = i;
        while (i<test.length() && test.charAt(i)!=c) {
            i++;
        }
        list.add(test.substring(start, i));
        i++;
    }
    return list;
}

如果可能,您可以省略 List 内容并直接对子字符串执行某些操作:

public static void split(String test, char c) {
    int i=0;
    while (i<=test.length()) {
        int start = i;
        while (i<test.length() && test.charAt(i)!=c) {
            i++;
        }
        String s = test.substring(start,i);
         // do something with the string here
        i++;
    }
}

在我的系统上,最后一个方法比 StringTokenizer 解决方案更快,但您可能想测试它如何为您工作。 (当然,您可以通过省略第二个 while 查找的 {} 来使这个方法更短一些,当然您可以使用 for 循环而不是外部 while 循环,并将最后一个 i++ 包含在其中,但我没有在这里不要这样做,因为我认为这种风格很糟糕。

You could do something like that. It's not perfect, but it might be working for you.

public static List<String> find(String test, char c) {
    List<String> list = new Vector<String>();
    start;
    int i=0;
    while (i<=test.length()) {
        int start = i;
        while (i<test.length() && test.charAt(i)!=c) {
            i++;
        }
        list.add(test.substring(start, i));
        i++;
    }
    return list;
}

If possible you can ommit the List thing and directly do something to the substring:

public static void split(String test, char c) {
    int i=0;
    while (i<=test.length()) {
        int start = i;
        while (i<test.length() && test.charAt(i)!=c) {
            i++;
        }
        String s = test.substring(start,i);
         // do something with the string here
        i++;
    }
}

On my System the last method is faster than the StringTokenizer-solution, but you might want to test how it works for you. (Of course you could make this method a little shorter by ommiting the {} of the second while look and of course you could use a for-loop instead of the outer while-loop and including the last i++ into that, but I didn't do that here because I consider that bad style.

亣腦蒛氧 2024-07-30 17:02:34

好吧,您能做的最快的事情就是手动遍历字符串,例如,

List<String> split(String s) {
        List<String> out= new ArrayList<String>();
           int idx = 0;
           int next = 0;
        while ( (next = s.indexOf( ',', idx )) > -1 ) {
            out.add( s.substring( idx, next ) );
            idx = next + 1;
        }
        if ( idx < s.length() ) {
            out.add( s.substring( idx ) );
        }
               return out;
    }

这(非正式测试)看起来是 split 的两倍。 然而,以这种方式迭代有点危险,例如,它会在转义逗号上中断,并且如果您最终需要在某个时刻处理这个问题(因为您的 10 亿个字符串列表有 3 个转义逗号),那么当您考虑到这一点,您可能最终会失去一些速度优势。

最终可能不值得这么麻烦。

Well, the fastest thing you could do would be to manually traverse the string, e.g.

List<String> split(String s) {
        List<String> out= new ArrayList<String>();
           int idx = 0;
           int next = 0;
        while ( (next = s.indexOf( ',', idx )) > -1 ) {
            out.add( s.substring( idx, next ) );
            idx = next + 1;
        }
        if ( idx < s.length() ) {
            out.add( s.substring( idx ) );
        }
               return out;
    }

This (informal test) looks to be something like twice as fast as split. However, it's a bit dangerous to iterate this way, for example it will break on escaped commas, and if you end up needing to deal with that at some point (because your list of a billion strings has 3 escaped commas) by the time you've allowed for it you'll probably end up losing some of the speed benefit.

Ultimately it's probably not worth the bother.

陌路终见情 2024-07-30 17:02:34

我推荐 Google 的 Guava Splitter
我将其与coobird测试进行了比较,得到以下结果:

字符串标记器 104
谷歌番石榴分离器 142
字符串.split 446
正则表达式 299

I would recommend Google's Guava Splitter.
I compared it with coobird test and got following results:

StringTokenizer 104
Google Guava Splitter 142
String.split 446
regexp 299

清君侧 2024-07-30 17:02:34

如果您的输入是结构化的,您可以查看 JavaCC 编译器。 它生成一个 java 类来读取您的输入。 它看起来像这样:

TOKEN { <CAT: "cat"> , <DOG:"gog"> }

input: (cat() | dog())*


cat: <CAT>
   {
   animals.add(new Animal("Cat"));
   }

dog: <DOG>
   {
   animals.add(new Animal("Dog"));
   }

If your input is structured, you can have a look at the JavaCC compiler. It generates a java class reading your input. It would look like this:

TOKEN { <CAT: "cat"> , <DOG:"gog"> }

input: (cat() | dog())*


cat: <CAT>
   {
   animals.add(new Animal("Cat"));
   }

dog: <DOG>
   {
   animals.add(new Animal("Dog"));
   }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文