Java 中 StringTokenizer 类与 String.split 方法的性能对比

发布于 2024-11-06 03:44:13 字数 329 浏览 1 评论 0原文

在我的软件中，我需要将字符串拆分为单词。我目前拥有超过 19,000,000 个文档，每个文档超过 30 个字。

以下两种方法中哪一种是执行此操作的最佳方法（就性能而言）？

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

或者

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

原文

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尐籹人 2024-11-13 03:44:13

如果你的数据已经在数据库中，你需要解析字符串，我建议重复使用indexOf。它比任何一种解决方案都快很多倍。

然而，从数据库获取数据的成本仍然可能要高得多。

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

打开一个文件的时间约为 8 毫秒。由于文件非常小，您的缓存可能会将性能提高 2-5 倍。即便如此，打开文件也要花费大约 10 个小时。使用 split 与 StringTokenizer 的成本分别远低于 0.01 毫秒。解析 1900 万 x 30 个单词 * 每个单词 8 个字母大约需要 10 秒（每 2 秒大约 1 GB）。

如果您想提高性能，我建议您使用更少的文件。例如使用数据库。如果您不想使用 SQL 数据库，我建议使用其中之一 http://nosql-database.org/

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

回复收藏 0 原文

时间你老了 2024-11-13 03:44:13

Java 7中的Split只是为此输入调用indexOf，查看源代码。 split 应该非常快，接近重复调用indexOf。

回复收藏 0 原文

原来是傀儡 2024-11-13 03:44:13

Java API 规范建议使用split。请参阅 StringTokenizer 的文档< /a>.

回复收藏 0 原文

塔塔猫 2024-11-13 03:44:13

据我所知，另一件重要的事情没有记录，那就是要求 StringTokenizer 返回分隔符以及标记化字符串（通过使用构造函数 StringTokenizer(String str, String delim, boolean returnDelims)）还减少了处理时间。因此，如果您正在寻找性能，我建议使用类似的方法：

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

尽管 getNext() 方法带来了开销，它会为您丢弃分隔符，但根据我的基准测试，它仍然快了 50%。

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

回复收藏 0 原文

邮友 2024-11-13 03:44:13

使用分割。

StringTokenizer 是一个遗留类，出于兼容性原因而保留，尽管在新代码中不鼓励使用它。建议任何寻求此功能的人使用 split 方法。

回复收藏 0 原文

尹雨沫 2024-11-13 03:44:13

19,000,000 份文件在那里有什么作用？您是否需要定期拆分所有文档中的单词？还是一击就有问题？

如果您一次显示/请求一份只有 30 个单词的文档，那么这是一个很小的问题，任何方法都可以解决。

如果您必须一次处理所有文档，并且只有 30 个单词，那么这是一个很小的问题，无论如何您都更有可能受到 IO 限制。

回复收藏 0 原文

冰葑 2024-11-13 03:44:13

在运行微型（在本例中，甚至是纳米）基准测试时，有很多因素会影响您的结果。 JIT 优化和垃圾收集仅举几例。

为了从微基准测试中获得有意义的结果，请查看 jmh图书馆。它提供了关于如何运行良好基准测试的优秀示例。

回复收藏 0 原文

记忆之渊 2024-11-13 03:44:13

无论其遗留状态如何，我希望 StringTokenizer 在此任务中比 String.split() 快得多，因为它不使用正则表达式：它只是扫描直接输入，就像您自己通过 indexOf() 进行的操作一样。事实上，String.split() 每次调用时都必须编译正则表达式，因此它甚至不如您自己直接使用正则表达式那么高效。

回复收藏 0 原文

昨迟人 2024-11-13 03:44:13

这可能是使用 1.6.0 的合理基准测试

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8

回复收藏 0 原文

~没有更多了~

关于作者

妄想挽回

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

Java 中 StringTokenizer 类与 String.split 方法的性能对比

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

13886483628

流年已逝

℡寂寞咖啡

笑看君怀她人

wkeithbarry

素手挽清风

友情链接

Java 中 StringTokenizer 类与 String.split 方法的性能对比

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

13886483628

流年已逝

℡寂寞咖啡

笑看君怀她人

wkeithbarry

素手挽清风

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。