高效的字符串排序算法

发布于 2024-11-28 11:08:18 字数 138 浏览 4 评论 0原文

通过比较对字符串进行排序（例如标准 QuickSort + strcmp 类函数）可能会有点慢，特别是对于共享公共前缀的长字符串（比较函数需要 O(s) 时间，其中 s 是字符串的长度），因此标准解的复杂度为 O(s * nlog n)。有没有已知的更快的算法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盗琴音 2024-12-05 11:08:18

如果您知道字符串仅包含某些字符（几乎总是如此），则可以使用 BucketSort 或 RadixSort。

回复收藏 0 原文

梦旅人picnic 2024-12-05 11:08:18

您可以构建一个trie，它应该是O(s*n)，我相信。

回复收藏 0 原文

心奴独伤 2024-12-05 11:08:18

请搜索“Sedgewick多键快速排序”（Sedgewick用C和Java写了著名的算法教科书）。他的算法相对容易实现并且速度相当快。就避免了你上面说的问题。有一种声称更快的突发排序算法，但我不知道有什么实现。

有一篇文章 C# 和 F# 中的快速字符串排序描述了算法并引用了 Sedgewick 的代码以及 C# 代码。（披露：这是我根据 Sedgewick 的论文编写的文章和代码）。

回复收藏 0 原文

又爬满兰若 2024-12-05 11:08:18

摘要

我发现了 string_sorting
由 Tommi Rantala 提供的 repo 非常全面，它包括许多已知的高效（字符串）排序算法，例如 MSD 基数排序、burstsort 和多键快速排序。此外，它们中的大多数也具有缓存效率。

我的经验

在我看来，三路基数/字符串快速排序是最快的字符串排序算法之一。另外，MSD 基数排序也是一个不错的选择。 Sedgewick 的优秀算法书中介绍了它们。

以下是对 leipzig1M.txt 进行排序的一些结果，取自此处：

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt

方法	时间
霍尔	7.8792s
Quick3Way	7.5074s
Fast3Way	5.78015s
RadixSort	4.86149s
Quick3String	4.3685s
堆排序	32.8318s
MergeSort	16.94s
std::sort/introsort	6.10666s
MSD+Q3S	3.74214s

三向基数/字符串快速排序的迷人之处在于它实现起来非常简单，实际上只需要大约十行源代码。

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

注意：但是如果你像我一样在 Google 上搜索“最快的字符串排序算法”，很可能是 burstsort，一种缓存感知的 MSD 基数排序变体（论文）。我还找到了 Bentley 和 Sedgewick 的这篇论文很有帮助，它使用了多键快速排序。

Summary

I found the string_sorting
repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.

My Experience

It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.

Here are some results to sort leipzig1M.txt taken from here:

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt

Method	Time
Hoare	7.8792s
Quick3Way	7.5074s
Fast3Way	5.78015s
RadixSort	4.86149s
Quick3String	4.3685s
Heapsort	32.8318s
MergeSort	16.94s
std::sort/introsort	6.10666s
MSD+Q3S	3.74214s

The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.

回复收藏 0 原文

~没有更多了~