Java 的 Scanner 与 String.split() 与 StringTokenizer; 我应该使用哪个?

发布于 2024-07-16 15:48:31 字数 249 浏览 5 评论 0原文

我目前正在使用 split() 扫描文件,其中每行都有由 '~' 分隔的字符串数量。 我在某处读到 Scanner 可以在性能方面更好地处理长文件,所以我考虑检查一下。

我的问题是:我是否必须创建两个 Scanner 实例? 也就是说,一个读取一行,另一个基于该行获取分隔符的标记? 如果我必须这样做,我怀疑我是否会从使用它中获得任何好处。 也许我在这里遗漏了一些东西?

I am currently using split() to scan through a file where each line has number of strings delimited by '~'. I read somewhere that Scanner could do a better job with a long file, performance-wise, so I thought about checking it out.

My question is: Would I have to create two instances of Scanner? That is, one to read a line and another one based on the line to get tokens for a delimiter? If I have to do so, I doubt if I would get any advantage from using it. Maybe I am missing something here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

够运 2024-07-23 15:48:31

在单线程模型中围绕这些进行了一些度量,这是我得到的结果。

~~~~~~~~~~~~~~~~~~Time Metrics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tokenizer  |   String.Split()   |    while+SubString  |    Scanner    |    ScannerWithCompiledPattern    ~
~   4.0 ms   |      5.1 ms        |        1.2 ms       |     0.5 ms    |                0.1 ms            ~
~   4.4 ms   |      4.8 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.2 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
____________________________________________________________________________________________________________

结果是 Scanner 提供了最佳性能,现在同样需要在多线程模式下进行评估! 我的一位学长说 Tokenizer 会导致 CPU 峰值,而 String.split 不会。

Did some metrics around these in a single threaded model and here are the results I got.

~~~~~~~~~~~~~~~~~~Time Metrics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tokenizer  |   String.Split()   |    while+SubString  |    Scanner    |    ScannerWithCompiledPattern    ~
~   4.0 ms   |      5.1 ms        |        1.2 ms       |     0.5 ms    |                0.1 ms            ~
~   4.4 ms   |      4.8 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.2 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
____________________________________________________________________________________________________________

The out come is that Scanner gives the best performance, Now the same needs to be evaluated on a multithreaded mode ! One of my senior's say that the Tokenizer gives a CPU spike and String.split does not.

凉城已无爱 2024-07-23 15:48:31

对于处理线,您可以使用扫描仪,对于从每条线获取令牌,您可以使用 split。

Scanner scanner = new Scanner(new File(loc));
try {
    while ( scanner.hasNextLine() ){
        String[] tokens = scanner.nextLine().split("~");
        // do the processing for tokens here
    }
}
finally {
    scanner.close();
}

For processing line you can use scanner and for getting tokens from each line you can use split.

Scanner scanner = new Scanner(new File(loc));
try {
    while ( scanner.hasNextLine() ){
        String[] tokens = scanner.nextLine().split("~");
        // do the processing for tokens here
    }
}
finally {
    scanner.close();
}
貪欢 2024-07-23 15:48:31

您可以使用 useDelimiter("~") 方法让您使用 hasNext()/next() 迭代每行上的标记,同时仍使用 hasNextLine()/nextLine() 迭代行本身。

编辑:如果您要进行性能比较,则应在执行 split() 测试时预编译正则表达式:

Pattern splitRegex = Pattern.compile("~");
while ((line = bufferedReader.readLine()) != null)
{
  String[] tokens = splitRegex.split(line);
  // etc.
}

如果您使用 String#split(String regex),则正则表达式将每次都要重新编译。 (扫描程序在第一次编译所有正则表达式时会自动缓存它们。)如果您这样做,我预计性能不会有太大差异。

You can use the useDelimiter("~") method to let you iterate through the tokens on each line with hasNext()/next(), while still using hasNextLine()/nextLine() to iterate through the lines themselves.

EDIT: If you're going to do a performance comparison, you should pre-compile the regex when you do the split() test:

Pattern splitRegex = Pattern.compile("~");
while ((line = bufferedReader.readLine()) != null)
{
  String[] tokens = splitRegex.split(line);
  // etc.
}

If you use String#split(String regex), the regex will be recompiled every time. (Scanner automatically caches all regexes the first time it compiles them.) If you do that, I wouldn't expect to see much difference in performance.

晨曦慕雪 2024-07-23 15:48:31

我想说 split() 是最快的,并且可能足以满足您正在做的事情。 但它的灵活性不如扫描仪。 StringTokenizer 已弃用,仅用于向后兼容,因此请勿使用它。

编辑:您始终可以测试这两种实现,看看哪一种更快。 我很好奇 scanner 是否可以比 split() 更快。 对于给定大小的分割可能会比扫描仪更快,但我不能确定这一点。

I would say split() is fastest, and probably good enough for what you're doing. It is less flexible than scanner though. StringTokenizer is deprecated and is only available for backwards compatibility, so don't use it.

EDIT: You could always test both implementations to see which one is faster. I'm curious myself if scanner could be faster than split(). Split might be faster for a given size VS Scanner, but I can't be certain of that.

行雁书 2024-07-23 15:48:31

这里实际上不需要正则表达式,因为您正在分割固定字符串。 Apache StringUtils split 对纯字符串进行分割。

对于大容量拆分,其中拆分是瓶颈,而不是文件 IO,我发现这比 String.split() 快 10 倍。 但是,我没有针对已编译的正则表达式对其进行测试。

Guava 还有一个分割器,以更面向对象的方式实现,但我发现它比 StringUtils 对于大容量分割要慢得多。

You don't actually need a regex here, because you are splitting on a fixed string. Apache StringUtils split does splitting on plain strings.

For high volume splits, where the splitting is the bottleneck, rather than say file IO, I've found this to be up to 10 times faster than String.split(). However, I did not test it against a compiled regex.

Guava also has a splitter, implemented in a more OO way, but I found it was significantly slower than StringUtils for high volume splits.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文