当前位置：文江博客话题详情

任意长度字符串的基数排序

发布于 2024-09-18 05:36:44 字数 107 浏览 12 评论 0原文

我需要对任意长度的大量文本字符串进行排序。我认为基数排序是这里最好的选择。列表确实很大，因此将字符串填充到相同的长度是完全不可能的。
是否有任何现成的实现可以完成此任务，最好是用 C# 实现？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光清浅 2024-09-25 05:36:44

根据您的需要，您可能会发现将所有字符串插入某种形式的 Trie 中是最佳解决方案。即使是基本的三元搜索 Trie 也比字符串数组/列表具有更小的内存占用，并且会按排序顺序存储字符串。

插入、查找和删除都是 O(k * log(a))，其中 a 是字母表的大小（字符可能值的数量）。由于 a 是常量，因此 log(a) 也是常量，因此您最终会得到 O(n * k) 排序算法。

编辑：如果您不熟悉 Tries，它们基本上是 n 叉树，其中每条边代表键的单个字符。插入时，您检查根节点是否包含与键的第一个字符匹配的边（或子节点，等等）。如果是这样，您将遵循该路径并重复第二个字符，依此类推。如果没有，则添加一条新边。在三元搜索 Trie 中，边/子元素存储在二叉树中，因此字符按排序顺序排列，并且可以在 log(a) 时间内搜索。如果你想浪费内存，你可以将边/子元素存储在一个大小为 a 的数组中，这样你就可以在每一步进行不断的查找。

回复收藏 0 原文

滥情空心 2024-09-25 05:36:44

请参阅此线程。基数排序或这个基数排序实现

回复收藏 0 原文

美羊羊 2024-09-25 05:36:44

有多少是多少，一百万？

内置 List.Sort()平均需要 O(n * log(n)) 。

log2(10^6) ~=20，对于 10^6 元素来说，这并不比 O(n) 慢很多。如果字符串长度超过 20 个字符，基数排序 O(n * k) 将会“变慢”。

我怀疑基数排序会比内置排序快得多。但测量和比较会很有趣。

回复收藏 0 原文

红焚 2024-09-25 05:36:44

编辑：我之前的这些言论有一定道理，但总体来说是错误的。

基数排序是对大量字符串使用的错误排序。对于像这样的事情，

I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!

您将有一堆条目落入同一个桶中。您可以通过散列来避免这种情况，但是对散列码进行排序有什么用呢？

使用快速排序或合并排序等。（快速排序通常性能更好并且占用更少的内存，但是许多示例的最坏情况性能为 O(N^2)，这在实践中几乎不会发生；合并排序的性能不太好，但通常实现得很稳定，并且它是很容易做到部分在内存中部分在磁盘上。）即使用内置的排序功能。

编辑：嗯，事实证明，至少在非常大的文件上，开头有很长的重复（例如源代码）并且许多行完全相同（实际上是 100 次重复），基数排序确实开始与快速排序竞争。我很惊讶！但是，无论如何，这是我用来实现基数排序的代码。它是用 Scala 编写的，而不是 C# 编写的，但我以相当迭代的方式编写了它，因此它的工作原理应该相当明显。唯一两个棘手的位是 (a(i)(ch) & 0xFF) 是从字节数组数组中提取 0-255 字节（字节是有符号的），即 >counts.scanLeft(0)(_ + _) 形成计数的累积和，从零开始（然后 indices.clone.take(257) 获取除最后一个之外的所有计数）一），并且 Scala 允许多个参数列表（因此我将始终提供的参数与具有递归中使用的默认值的参数分开）。时间是这样的：

def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
  val counts = new Array[Int](257)
  var i = i0
  while (i < i1) {
    if (a(i).length <= ch) counts(0) += 1
    else { counts((a(i)(ch)&0xFF)+1) += 1 }
    i += 1
  }
  val indices = counts.scanLeft(0)(_ + _)
  val starts = indices.clone.take(257)
  i = i0
  while (i < i1) {
    val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
    if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
      if (indices(bucket) <= i) indices(bucket) = i+1
      i += 1
    }
    else {
      val temp = a(indices(bucket)+i0)
      a(indices(bucket)+i0) = a(i)
      a(i) = temp
      indices(bucket) += 1
    }
  }
  i = 1
  while (i < counts.length) {
    if (counts(i)>1) {
      radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
    }
    i += 1
  }
}

对于 7M 行源代码（70k 行的 100 倍重复），基数排序与内置库排序并列，并随后获胜。

Edit: there is a point to these statements I made previously, but the point is wrong overall.

Radix sort is the wrong sort to use on large numbers of strings. For things like

I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!

you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?

Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.

Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that (a(i)(ch) & 0xFF) is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), that counts.scanLeft(0)(_ + _) forms a cumulative sum of the counts, starting from zero (and then indices.clone.take(257) takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:

def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
  val counts = new Array[Int](257)
  var i = i0
  while (i < i1) {
    if (a(i).length <= ch) counts(0) += 1
    else { counts((a(i)(ch)&0xFF)+1) += 1 }
    i += 1
  }
  val indices = counts.scanLeft(0)(_ + _)
  val starts = indices.clone.take(257)
  i = i0
  while (i < i1) {
    val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
    if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
      if (indices(bucket) <= i) indices(bucket) = i+1
      i += 1
    }
    else {
      val temp = a(indices(bucket)+i0)
      a(indices(bucket)+i0) = a(i)
      a(i) = temp
      indices(bucket) += 1
    }
  }
  i = 1
  while (i < counts.length) {
    if (counts(i)>1) {
      radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
    }
    i += 1
  }
}

And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.

回复收藏 0 原文

终遇你 2024-09-25 05:36:44

String.Compare() 重载正在使用此类字符串比较。看看你需要什么，将其提供给你的
排序算法。

更新

这是实现：

[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);

很难用您自己的实现来击败这个本机非托管实现。

String.Compare() overloads are using such string comparison. See what you need is to feed this to your
sort algorithm.

UPDATE

This is the implementation:

[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);

Hard to beat this native unmanaged implementation with your own implementation.

回复收藏 0 原文

~没有更多了~