使用 StringTokenizer 复制 String.split
受到 this 的鼓励,而且我有数十亿个字符串需要解析,我尝试了修改我的代码以接受 StringTokenizer 而不是 String[]
在我和获得美味的 x2 性能提升之间唯一剩下的事情是当你在做的时候
"dog,,cat".split(",")
//output: ["dog","","cat"]
StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"
如何我使用 StringTokenizer 获得了类似的结果? 有没有更快的方法来做到这一点?
Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]
The only thing left between me and getting that delicious x2 performance boost is the fact that when you're doing
"dog,,cat".split(",")
//output: ["dog","","cat"]
StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"
How can I achieve similar results with the StringTokenizer? Are there faster ways to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您实际上只对逗号进行标记吗? 如果是这样,我会编写自己的标记生成器 - 它最终可能比更通用的 StringTokenizer 更有效,后者可以查找多个标记,并且您可以使其按照您想要的方式运行。 对于这样一个简单的用例,它可以是一个简单的实现。
如果它有用,您甚至可以实现 Iterable并通过强类型获得增强的 for 循环支持,而不是
StringTokenizer 提供的
。 如果您需要任何帮助来编写这样一个野兽,请告诉我 - 这真的不应该太难。Enumeration
支持此外,在脱离现有解决方案之前,我会尝试对实际数据运行性能测试。 您知道您的执行时间有多少实际花在
String.split
上吗? 我知道你有很多字符串需要解析,但是如果你之后对它们做任何重要的事情,我希望这比分割更重要。Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.
If it would be useful, you could even implement
Iterable<String>
and get enhanced-for-loop support with strong typing instead of theEnumeration
support provided byStringTokenizer
. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in
String.split
? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.修改
StringTokenizer
< /a> 类,我找不到满足返回["dog", "", "cat"]
要求的方法。此外,仅出于兼容性原因保留
StringTokenizer
类,并且鼓励使用String.split
。 来自StringTokenizer
的 API 规范:由于问题是
String.split
方法,我们需要找到替代方法。注意:我说“据说性能很差”,因为很难确定每个用例都会导致
StringTokenizer
优于String.split
方法。 此外,在许多情况下,除非字符串的标记化确实是通过适当的分析确定的应用程序的瓶颈,否则我认为这最终将是一种过早的优化(如果有的话)。 我倾向于说在进行优化之前先编写有意义且易于理解的代码。现在,从当前的要求来看,滚动我们自己的标记生成器可能不会太困难。
推出我们自己的分词器!
以下是我编写的一个简单的分词器。 我应该注意到,没有速度优化,也没有错误检查来防止超过字符串末尾——这是一个快速而肮脏的实现:
MyTokenizer
将采用 < code>String 进行标记,并使用String
作为分隔符,并使用String.indexOf
方法执行分隔符搜索。 令牌由String.substring
方法生成。我怀疑通过在 char[] 级别而不是在 String 级别处理字符串可能会提高一些性能。 但我会将其作为练习留给读者。
该类还实现了
Iterable
< /a> 和迭代器
< /a> 以便利用 Java 5 中引入的for-each
循环构造。StringTokenizer
是一个Enumerator
,并且不支持 for-each 结构。更快吗?
为了找出是否更快,我编写了一个程序来比较以下四种方法的速度:
StringTokenizer
。MyTokenizer
。String.split
。Pattern.compile
。在这四种方法中,字符串
“dog,,cat”
被分成标记。 需要注意的是,它不会返回["dog", "", "cat]
所需的结果。虽然比较中包含了
StringTokenizer
,但 总共重复了 100 万次,以便有足够的时间来注意到方法中的差异用于简单基准测试的代码如下:
结果
使用 Java SE 6 运行测试。 (版本 1.6.0_12-b04),结果如下:
因此,从有限的测试和仅五次运行中可以看出,
StringTokenizer
实际上确实是最快的,但是 < code>MyTokenizer 紧随其后,String.split
是最慢的,预编译的正则表达式比split
方法稍快。与任何小基准一样,它可能不能很好地代表现实生活中的情况,因此对结果应该持保留态度。
After tinkering with the
StringTokenizer
class, I could not find a way to satisfy the requirements to return["dog", "", "cat"]
.Furthermore, the
StringTokenizer
class is left only for compatibility reasons, and the use ofString.split
is encouaged. From the API Specification for theStringTokenizer
:Since the issue is the supposedly poor performance of the
String.split
method, we need to find an alternative.Note: I am saying "supposedly poor performance" because it's hard to determine that every use case is going to result in the
StringTokenizer
being superior to theString.split
method. Furthermore, in many cases, unless the tokenization of the strings are indeed the bottleneck of the application determined by proper profiling, I feel that it will end up being a premature optimization, if anything. I would be inclined to say write code that is meaningful and easy to understand before venturing on optimization.Now, from the current requirements, probably rolling our own tokenizer wouldn't be too difficult.
Roll our own tokenzier!
The following is a simple tokenizer I wrote. I should note that there are no speed optimizations, nor is there error-checks to prevent going past the end of the string -- this is a quick-and-dirty implementation:
The
MyTokenizer
will take aString
to tokenize and aString
as a delimiter, and use theString.indexOf
method to perform the search for delimiters. Tokens are produced by theString.substring
method.I would suspect there could be some performance improvements by working on the string at the
char[]
level rather than at theString
level. But I'll leave that up as an exercise to the reader.The class also implements
Iterable
andIterator
in order to take advantage of thefor-each
loop construct that was introduced in Java 5.StringTokenizer
is anEnumerator
, and does not support thefor-each
construct.Is it any faster?
In order to find out if this is any faster, I wrote a program to compare speeds in the following four methods:
StringTokenizer
.MyTokenizer
.String.split
.Pattern.compile
.In the four methods, the string
"dog,,cat"
was separated into tokens. Although theStringTokenizer
is included in the comparison, it should be noted that it will not return the desired result of["dog", "", "cat]
.The tokenizing was repeated for a total of 1 million times to give take enough time to notice the difference in the methods.
The code used for the simple benchmark was the following:
The Results
The tests were run using Java SE 6 (build 1.6.0_12-b04), and results were the following:
So, as can be seen from the limited testing and only five runs, the
StringTokenizer
did in fact come out the fastest, but theMyTokenizer
came in as a close 2nd. Then,String.split
was the slowest, and the precompiled regular expression was slightly faster than thesplit
method.As with any little benchmark, it probably isn't very representative of real-life conditions, so the results should be taken with a grain (or a mound) of salt.
注意:经过一些快速基准测试,Scanner 的速度比 String.split 慢大约四倍。 因此,不要使用扫描仪。
(我留下这篇文章是为了记录扫描仪在这种情况下是一个坏主意的事实。(阅读为:请不要因为我建议扫描仪而对我投反对票......) )
假设您使用的是 Java 1.5 或更高版本,请尝试 Scanner,它实现了
Iterator
,恰好:给出:
Note: Having done some quick benchmarks, Scanner turns out to be about four times slower than String.split. Hence, do not use Scanner.
(I'm leaving the post up to record the fact that Scanner is a bad idea in this case. (Read as: do not downvote me for suggesting Scanner, please...))
Assuming you are using Java 1.5 or higher, try Scanner, which implements
Iterator<String>
, as it happens:gives:
根据您需要标记的字符串类型,您可以基于 String.indexOf() 编写自己的拆分器。 您还可以创建多核解决方案来进一步提高性能,因为字符串的标记化是相互独立的。 可以批量处理(假设)每个核心 100 个字符串。 执行 String.split() 或其他操作。
Depending on what kind of strings you need to tokenize, you can write your own splitter based on String.indexOf() for example. You could also create a multi-core solution to improve performance even further, as the tokenization of strings is independent from each other. Work on batches of -lets say- 100 strings per core. Do the String.split() or watever else.
您可以尝试使用 Apache Commons Lang 中的 StrTokenizer 类,而不是 StringTokenizer,我引用了该类:
我想这听起来像是您所需要的?
Rather than StringTokenizer, you could try the StrTokenizer class from Apache Commons Lang, which I quote:
This sounds like what you need, I think?
你可以做类似的事情。 它并不完美,但可能对你有用。
如果可能,您可以省略 List 内容并直接对子字符串执行某些操作:
在我的系统上,最后一个方法比 StringTokenizer 解决方案更快,但您可能想测试它如何为您工作。 (当然,您可以通过省略第二个 while 查找的 {} 来使这个方法更短一些,当然您可以使用 for 循环而不是外部 while 循环,并将最后一个 i++ 包含在其中,但我没有在这里不要这样做,因为我认为这种风格很糟糕。
You could do something like that. It's not perfect, but it might be working for you.
If possible you can ommit the List thing and directly do something to the substring:
On my System the last method is faster than the StringTokenizer-solution, but you might want to test how it works for you. (Of course you could make this method a little shorter by ommiting the {} of the second while look and of course you could use a for-loop instead of the outer while-loop and including the last i++ into that, but I didn't do that here because I consider that bad style.
好吧,您能做的最快的事情就是手动遍历字符串,例如,
这(非正式测试)看起来是 split 的两倍。 然而,以这种方式迭代有点危险,例如,它会在转义逗号上中断,并且如果您最终需要在某个时刻处理这个问题(因为您的 10 亿个字符串列表有 3 个转义逗号),那么当您考虑到这一点,您可能最终会失去一些速度优势。
最终可能不值得这么麻烦。
Well, the fastest thing you could do would be to manually traverse the string, e.g.
This (informal test) looks to be something like twice as fast as split. However, it's a bit dangerous to iterate this way, for example it will break on escaped commas, and if you end up needing to deal with that at some point (because your list of a billion strings has 3 escaped commas) by the time you've allowed for it you'll probably end up losing some of the speed benefit.
Ultimately it's probably not worth the bother.
我推荐 Google 的 Guava
Splitter
。我将其与coobird测试进行了比较,得到以下结果:
I would recommend Google's Guava
Splitter
.I compared it with coobird test and got following results:
如果您的输入是结构化的,您可以查看 JavaCC 编译器。 它生成一个 java 类来读取您的输入。 它看起来像这样:
If your input is structured, you can have a look at the JavaCC compiler. It generates a java class reading your input. It would look like this: