任意长度字符串的基数排序
我需要对任意长度的大量文本字符串进行排序。我认为基数排序是这里最好的选择。列表确实很大,因此将字符串填充到相同的长度是完全不可能的。
是否有任何现成的实现可以完成此任务,最好是用 C# 实现?
I need to sort a huge list of text strings of arbitrary length. I suppose radix sort is the best option here. List is really huge, so padding strings to the same length is completely impossible.
Is there any ready-made implementation for this task, preferably in C#?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
根据您的需要,您可能会发现将所有字符串插入某种形式的 Trie 中是最佳解决方案。即使是基本的三元搜索 Trie 也比字符串数组/列表具有更小的内存占用,并且会按排序顺序存储字符串。
插入、查找和删除都是
O(k * log(a))
,其中a
是字母表的大小(字符可能值的数量)。由于a
是常量,因此log(a)
也是常量,因此您最终会得到O(n * k)
排序算法。编辑:如果您不熟悉 Tries,它们基本上是 n 叉树,其中每条边代表键的单个字符。插入时,您检查根节点是否包含与键的第一个字符匹配的边(或子节点,等等)。如果是这样,您将遵循该路径并重复第二个字符,依此类推。如果没有,则添加一条新边。在三元搜索 Trie 中,边/子元素存储在二叉树中,因此字符按排序顺序排列,并且可以在
log(a)
时间内搜索。如果你想浪费内存,你可以将边/子元素存储在一个大小为 a 的数组中,这样你就可以在每一步进行不断的查找。Depending on what you need, you might find inserting all the strings into some form of Trie to be the best solution. Even a basic Ternary Search Trie will have a smaller memory footprint than an array/list of strings and will store the strings in sorted order.
Insertion, lookup and removal are all
O(k * log(a))
wherea
is the size of your alphabet (the number of possible values for a character). Sincea
is constant so islog(a)
so you end up with aO(n * k)
algorithm for sorting.Edit: In case you are unfamiliar with Tries, they are basically n-ary trees where each edge represents a single character of the key. When inserting, you check if the root node contains an edge (or child, whatever) that matches the first character of your key. If so, you follow that path and repeat with the second character and so on. If not, you add a new edge. In a Ternary Search Trie, the edges/children are stored in a binary tree so the characters are in sorted order and can be searched in
log(a)
time. If you want to waste memory you can store the edges/children in an array of sizea
which gives you constant lookup at each step.请参阅此线程。 基数排序 或这个 基数排序实现
See this thread. radix sort or this one radix sort implementation
有多少是多少,一百万?
内置
List.Sort()
平均需要 O(n * log(n)) 。log2(10^6) ~=20,对于 10^6 元素来说,这并不比 O(n) 慢很多。如果字符串长度超过 20 个字符,基数排序 O(n * k) 将会“变慢”。
我怀疑基数排序会比内置排序快得多。但测量和比较会很有趣。
How many are many, one million?
The built in
List<string>.Sort()
takes O(n * log(n)) on average.log2(10^6) ~=20, that is not very much slower than O(n) for 10^6 elements. If your strings are more than 20 characters long radix sort O(n * k) will be "slower".
I doubt a radix sort will be significantly faster than the built in sort. But it would be fun to measure and compare.
编辑:我之前的这些言论有一定道理,但总体来说是错误的。
基数排序是对大量字符串使用的错误排序。对于像这样的事情,
您将有一堆条目落入同一个桶中。您可以通过散列来避免这种情况,但是对散列码进行排序有什么用呢?
使用快速排序或合并排序等。 (快速排序通常性能更好并且占用更少的内存,但是许多示例的最坏情况性能为 O(N^2),这在实践中几乎不会发生;合并排序的性能不太好,但通常实现得很稳定,并且它是很容易做到部分在内存中部分在磁盘上。)即使用内置的排序功能。
编辑:嗯,事实证明,至少在非常大的文件上,开头有很长的重复(例如源代码)并且许多行完全相同(实际上是 100 次重复),基数排序确实开始与快速排序竞争。我很惊讶!但是,无论如何,这是我用来实现基数排序的代码。它是用 Scala 编写的,而不是 C# 编写的,但我以相当迭代的方式编写了它,因此它的工作原理应该相当明显。唯一两个棘手的位是
(a(i)(ch) & 0xFF)
是从字节数组数组中提取 0-255 字节(字节是有符号的),即>counts.scanLeft(0)(_ + _)
形成计数的累积和,从零开始(然后indices.clone.take(257)
获取除最后一个之外的所有计数)一),并且 Scala 允许多个参数列表(因此我将始终提供的参数与具有递归中使用的默认值的参数分开)。 时间是这样的:对于 7M 行源代码(70k 行的 100 倍重复),基数排序与内置库排序并列,并随后获胜。
Edit: there is a point to these statements I made previously, but the point is wrong overall.
Radix sort is the wrong sort to use on large numbers of strings. For things like
you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?
Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.
Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that
(a(i)(ch) & 0xFF)
is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), thatcounts.scanLeft(0)(_ + _)
forms a cumulative sum of the counts, starting from zero (and thenindices.clone.take(257)
takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.
String.Compare() 重载正在使用此类字符串比较。看看你需要什么,将其提供给你的
排序算法。
更新
这是实现:
很难用您自己的实现来击败这个本机非托管实现。
String.Compare() overloads are using such string comparison. See what you need is to feed this to your
sort algorithm.
UPDATE
This is the implementation:
Hard to beat this native unmanaged implementation with your own implementation.