高效的字符串排序算法

发布于 2024-11-28 11:08:18 字数 138 浏览 4 评论 0原文

通过比较对字符串进行排序(例如标准 QuickSort + strcmp 类函数)可能会有点慢,特别是对于共享公共前缀的长字符串(比较函数需要 O(s) 时间,其中 s 是字符串的长度),因此标准解的复杂度为 O(s * nlog n)。有没有已知的更快的算法?

Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O(s) time, where s is the length of string), thus a standard solution has the complexity of O(s * nlog n). Are there any known faster algorithms?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

盗琴音 2024-12-05 11:08:18

如果您知道字符串仅包含某些字符(几乎总是如此),则可以使用 BucketSortRadixSort

If you know that the string consist only of certain characters (which is almost always the case), you can use a variant of BucketSort or RadixSort.

梦旅人picnic 2024-12-05 11:08:18

您可以构建一个trie,它应该是O(s*n),我相信。

You could build a trie, which should be O(s*n), I believe.

心奴独伤 2024-12-05 11:08:18

请搜索“Sedgewick多键快速排序”(Sedgewick用C和Java写了著名的算法教科书)。他的算法相对容易实现并且速度相当快。就避免了你上面说的问题。有一种声称更快的突发排序算法,但我不知道有什么实现。

有一篇文章 C# 和 F# 中的快速字符串排序 描述了算法并引用了 Sedgewick 的代码以及 C# 代码。 (披露:这是我根据 Sedgewick 的论文编写的文章和代码)。

Please search for "Sedgewick Multikey quick sort" (Sedgewick wrote famous algorithms textbooks in C and Java). His algorithm is relatively easy to implement and quite fast. It avoids the problem you are talking above. There is the burst sort algorithm which claims to be faster, but I don't know of any implementation.

There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).

又爬满兰若 2024-12-05 11:08:18

摘要

我发现了 string_sorting
Tommi Rantala 提供的 repo 非常全面,它包括许多已知的高效(字符串)排序算法,例如 MSD 基数排序、burstsort 和多键快速排序。此外,它们中的大多数也具有缓存效率。

我的经验

在我看来,三路基数/字符串快速排序是最快的字符串排序算法之一。另外,MSD 基数排序也是一个不错的选择。 Sedgewick 的优秀算法书中介绍了它们。

以下是对 leipzig1M.txt 进行排序的一些结果,取自 此处

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt
方法时间
霍尔7.8792s
Quick3Way7.5074s
Fast3Way5.78015s
RadixSort4.86149s
Quick3String4.3685s
堆排序32.8318s
MergeSort16.94s
std::sort/introsort6.10666s
MSD+Q3S3.74214s

三向基数/字符串快速排序的迷人之处在于它实现起来非常简单,实际上只需要大约十行源代码。

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

注意:但是如果你像我一样在 Google 上搜索“最快的字符串排序算法”,很可能是 burstsort,一种缓存感知的 MSD 基数排序变体(论文)。我还找到了 Bentley 和 Sedgewick 的这篇论文很有帮助,它使用了多键快速排序。

Summary

I found the string_sorting
repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.

My Experience

It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.

Here are some results to sort leipzig1M.txt taken from here:

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt
MethodTime
Hoare7.8792s
Quick3Way7.5074s
Fast3Way5.78015s
RadixSort4.86149s
Quick3String4.3685s
Heapsort32.8318s
MergeSort16.94s
std::sort/introsort6.10666s
MSD+Q3S3.74214s

The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文