String.Split效率问题
我正在编写一个搜索应用程序,用于标记大型文本语料库。
文本解析器需要从文本中删除任何乱码(即[^a-zA-Z0-9])
我脑子里有两个想法如何做到这一点:
1)将文本放入字符串中,将其转换为 charArray使用 String.tocharArray,然后使用循环逐个字符运行 -> while(位置<字符串.长度) 这样做,我可以在一次对文本的运行中标记整个字符串数组。
2)使用 string.replace 去除所有非数字/字母,然后使用一些分隔符进行 string.split ,这意味着我必须在整个字符串上运行两次。 一次删除坏字符,然后再次拆分它。
我假设,由于 #1 与 #2 的作用相同,但在 O(n) 中它会更快,但在测试了两者之后,#2 更快。
我更进一步,使用 red-gate .net 反射器查看了 String.Strip 背后的代码。 它按字符运行非托管字符,就像 #1 一样,但速度仍然快得多。
我不知道为什么#2 比#1 快得多。
有什么想法吗?
I am writing a Search application that tokenize a big textual corpus.
The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])
I had 2 ideas in my head how to do this:
1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length)
Doing so I can tokenize the entire string array in one run over the text.
2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string.
Once to remove bad chars and then again to split it.
I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.
I went even further and viewed the code behind String.Strip using red-gate .net reflector.
It runs unmanaged char by char just like #1, but still much much faster.
I have no clue why #2 is way faster than #1.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个想法怎么样:
我不知道这是否会比您已经尝试过的更快,但上述时间至少应该回答这个问题。
How about this idea:
I don't know if this will be any faster to what you've already tried, but timing the above should at least answer that question.
djTeller,
事实上#2 更快只是相对于你的#1 方法而言。
您可能想与我们分享您的#1 方法;也许它只是非常慢,甚至有可能比 #2 更快。
是的,两者本质上都是 O(n),但实际实现是 O(n);你实际上是怎么做到#1的?
另外,当你说你测试了两者时,我希望你进行了大量的输入来克服误差幅度并看到两者之间的显着差异。
djTeller,
The fact that #2 is faster is merely relative to your #1 method.
You might want to share your #1 method with us; maybe it's just very slow and is possible to make it faster than #2, even.
Yes both are essentially O(n), but is the ACTUAL implementation O(n); how'd you actually do #1?
Also, when you said you tested both, I hope you did with large amounts of input to overcome the margin of error and see a significant difference between the two.