String.Split效率问题

发布于 2024-10-01 02:04:25 字数 521 浏览 9 评论 0原文

我正在编写一个搜索应用程序,用于标记大型文本语料库。

文本解析器需要从文本中删除任何乱码(即[^a-zA-Z0-9])

我脑子里有两个想法如何做到这一点:

1)将文本放入字符串中,将其转换为 charArray使用 String.tocharArray,然后使用循环逐个字符运行 -> while(位置<字符串.长度) 这样做,我可以在一次对文本的运行中标记整个字符串数组。

2)使用 string.replace 去除所有非数字/字母,然后使用一些分隔符进行 string.split ,这意味着我必须在整个字符串上运行两次。 一次删除坏字符,然后再次拆分它。

我假设,由于 #1 与 #2 的作用相同,但在 O(n) 中它会更快,但在测试了两者之后,#2 更快。

我更进一步,使用 red-gate .net 反射器查看了 String.Strip 背后的代码。 它按字符运行非托管字符,就像 #1 一样,但速度仍然快得多。

我不知道为什么#2 比#1 快得多。

有什么想法吗?

I am writing a Search application that tokenize a big textual corpus.

The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])

I had 2 ideas in my head how to do this:

1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length)
Doing so I can tokenize the entire string array in one run over the text.

2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string.
Once to remove bad chars and then again to split it.

I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.

I went even further and viewed the code behind String.Strip using red-gate .net reflector.
It runs unmanaged char by char just like #1, but still much much faster.

I have no clue why #2 is way faster than #1.

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

清君侧 2024-10-08 02:04:25

这个想法怎么样:

  1. 创建一个字符串
  2. 将整个数据集加载到字符串中
  3. 创建一个具有足够预分配空间以容纳整个字符串的 StringBuilder
  4. 逐个字符地遍历字符串,如果字符是字母数字,则将其添加到 StringBuilder。
  5. 最后,从 StringBuilder 中取出字符串。

我不知道这是否会比您已经尝试过的更快,但上述时间至少应该回答这个问题。

How about this idea:

  1. Create a string
  2. Load the entire data set into the string
  3. Create a StringBuilder with enough pre-allocated space to hold the entire string
  4. Go character by character through the string and if the character is alphanumeric, add it to the StringBuilder.
  5. At the end, get the string out of the StringBuilder.

I don't know if this will be any faster to what you've already tried, but timing the above should at least answer that question.

墟烟 2024-10-08 02:04:25

djTeller,
事实上#2 更快只是相对于你的#1 方法而言。
您可能想与我们分享您的#1 方法;也许它只是非常慢,甚至有可能比 #2 更快。
是的,两者本质上都是 O(n),但实际实现是 O(n);你实际上是怎么做到#1的?

另外,当你说你测试了两者时,我希望你进行了大量的输入来克服误差幅度并看到两者之间的显着差异。

djTeller,
The fact that #2 is faster is merely relative to your #1 method.
You might want to share your #1 method with us; maybe it's just very slow and is possible to make it faster than #2, even.
Yes both are essentially O(n), but is the ACTUAL implementation O(n); how'd you actually do #1?

Also, when you said you tested both, I hope you did with large amounts of input to overcome the margin of error and see a significant difference between the two.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文