Java 中 StringTokenizer 类与 String.split 方法的性能对比

发布于 2024-11-06 03:44:13 字数 329 浏览 1 评论 0原文

在我的软件中,我需要将字符串拆分为单词。我目前拥有超过 19,000,000 个文档,每个文档超过 30 个字。

以下两种方法中哪一种是执行此操作的最佳方法(就性能而言)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

或者

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

or

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

尐籹人 2024-11-13 03:44:13

如果你的数据已经在数据库中,你需要解析字符串,我建议重复使用indexOf。它比任何一种解决方案都快很多倍。

然而,从数据库获取数据的成本仍然可能要高得多。

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

打开一个文件的时间约为 8 毫秒。由于文件非常小,您的缓存可能会将性能提高 2-5 倍。即便如此,打开文件也要花费大约 10 个小时。使用 split 与 StringTokenizer 的成本分别远低于 0.01 毫秒。解析 1900 万 x 30 个单词 * 每个单词 8 个字母大约需要 10 秒(每 2 秒大约 1 GB)。

如果您想提高性能,我建议您使用更少的文件。例如使用数据库。如果您不想使用 SQL 数据库,我建议使用其中之一 http://nosql-database.org/

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

时间你老了 2024-11-13 03:44:13

Java 7中的Split只是为此输入调用indexOf, 查看源代码。 split 应该非常快,接近重复调用indexOf。

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

原来是傀儡 2024-11-13 03:44:13

Java API 规范建议使用split。请参阅 StringTokenizer 的文档< /a>.

The Java API specification recommends using split. See the documentation of StringTokenizer.

塔塔猫 2024-11-13 03:44:13

据我所知,另一件重要的事情没有记录,那就是要求 StringTokenizer 返回分隔符以及标记化字符串(通过使用构造函数 StringTokenizer(String str, String delim, boolean returnDelims))还减少了处理时间。因此,如果您正在寻找性能,我建议使用类似的方法:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

尽管 getNext() 方法带来了开销,它会为您丢弃分隔符,但根据我的基准测试,它仍然快了 50%。

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

邮友 2024-11-13 03:44:13

使用分割。

StringTokenizer 是一个遗留类,出于兼容性原因而保留,尽管在新代码中不鼓励使用它。建议任何寻求此功能的人使用 split 方法。

Use split.

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.

尹雨沫 2024-11-13 03:44:13

19,000,000 份文件在那里有什么作用?您是否需要定期拆分所有文档中的单词?还是一击就有问题?

如果您一次显示/请求一份只有 30 个单词的文档,那么这是一个很小的问题,任何方法都可以解决。

如果您必须一次处理所有文档,并且只有 30 个单词,那么这是一个很小的问题,无论如何您都更有可能受到 IO 限制。

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

冰葑 2024-11-13 03:44:13

在运行微型(在本例中,甚至是纳米)基准测试时,有很多因素会影响您的结果。 JIT 优化和垃圾收集仅举几例。

为了从微基准测试中获得有意义的结果,请查看 jmh图书馆。它提供了关于如何运行良好基准测试的优秀示例。

While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.

In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.

记忆之渊 2024-11-13 03:44:13

无论其遗留状态如何,我希望 StringTokenizer 在此任务中比 String.split() 快得多,因为它不使用正则表达式:它只是扫描直接输入,就像您自己通过 indexOf() 进行的操作一样。事实上,String.split() 每次调用时都必须编译正则表达式,因此它甚至不如您自己直接使用正则表达式那么高效。

Regardless of its legacy status, I would expect StringTokenizer to be significantly quicker than String.split() for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf(). In fact String.split() has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.

昨迟人 2024-11-13 03:44:13

这可能是使用 1.6.0 的合理基准测试

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文