“(n log n) 屏障”的规则是什么？用于排序算法？

发布于 2024-11-30 21:43:03 字数 1048 浏览 6 评论 0原文

我写了一个简单的程序，排序时间复杂度为 O(n)。它的内存效率非常低，但这不是重点。

它使用 HashMap 背后的原理进行排序：

public class NLogNBreak {
    public static class LinkedListBack {
        public LinkedListBack(int val){
            first = new Node();
            first.val = val;
        }
        public Node first = null;
        public void insert(int i){
            Node n = new Node();
            n.val = i;
            n.next = first;
            first = n;
        }
    }

    private static class Node {
        public Node next = null;
        public int val;
    }

    //max > in[i] > 0
    public static LinkedListBack[] sorted(int[] in, int max){
        LinkedListBack[] ar = new LinkedListBack[max + 1];
        for (int i = 0; i < in.length; i++) {
            int val = in[i];
            if(ar[val] == null){
                ar[val] = new LinkedListBack(val);
            } else {
                ar[val].insert(val);
            }
        }
        return ar;
    }
}

那么，即使它以一种时髦的格式返回结果，这也算作一种 O(n) 吗？

原文

I wrote a simple program that sorts in O(n). It is highly memory inefficient, but that's not the point.

It uses the principle behind a HashMap for sorting:

public class NLogNBreak {
    public static class LinkedListBack {
        public LinkedListBack(int val){
            first = new Node();
            first.val = val;
        }
        public Node first = null;
        public void insert(int i){
            Node n = new Node();
            n.val = i;
            n.next = first;
            first = n;
        }
    }

    private static class Node {
        public Node next = null;
        public int val;
    }

    //max > in[i] > 0
    public static LinkedListBack[] sorted(int[] in, int max){
        LinkedListBack[] ar = new LinkedListBack[max + 1];
        for (int i = 0; i < in.length; i++) {
            int val = in[i];
            if(ar[val] == null){
                ar[val] = new LinkedListBack(val);
            } else {
                ar[val].insert(val);
            }
        }
        return ar;
    }
}

So does this count as a sort of O(n), even though it returns the result in a funky format?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏雨一样淡蓝的深情 2024-12-07 21:43:03

直接回答你的问题：

从技术上讲，你的排序算法不是 O(n) 而是 O(n + max)，因为你需要创建一个大小为 max 的数组，这需要 O(max) 时间。
这不是问题；事实上，这是打破 Ω(n log n) 障碍的著名排序算法的特例。

那么这个 Ω(n log n) 势垒是什么？它从哪里来？你如何打破它？

Ω(n log n) 屏障

Ω(n log n) 屏障是任何基于比较的排序算法的平均情况速度的信息理论下限。如果允许您应用于数组元素以区分它们的唯一操作是执行某种比较，那么您的排序算法在平均情况下不会比 Ω(n log n) 更好。

为了理解其中的原因，让我们考虑一下算法在执行过程中任意时刻的状态。当算法运行时，它可以获得一些有关输入元素排序方式的信息。假设，如果算法有一些关于输入元素原始排序的信息 X，那么算法就处于状态 X。

Ω(n log n) 参数（以及几个相关参数，我将在稍后讨论）的问题是算法必须能够根据输入进入大量不同的状态。现在我们假设排序算法的输入是一个包含 n 个不同值的数组。因为除了这些元素的排序方式之外，算法无法告诉任何有关这些元素的信息，因此排序的值是什么并不重要。重要的是这 n 个元素之间的相对顺序。

现在是关键步骤 - 假设有 f(n) 种独特的方式对 n 个输入元素进行排序，并假设我们的排序算法不能进入至少 f(n) 种不同的状态。如果是这种情况，那么数组中的元素必须有两种不同的顺序，算法总是将它们组合在一起形成相同的状态。如果发生这种情况，则排序算法不可能对两个输入数组进行正确排序。其背后的原因是，由于该算法对两个数组的处理方式相同，因此无论它使用什么步骤对第一个数组的元素重新排序，都将与它对第二个数组的元素重新排序所使用的步骤相同。由于这两个数组不相同，因此在这两种情况之一中必须至少有一个元素不合适。因此，我们知道排序算法必须能够进入 f(n) 个不同的状态。

但算法如何进入这些不同的状态呢？好吧，让我们考虑一下。最初，该算法根本没有有关元素排序的信息。当它进行第一次比较时（例如，在元素 A[i] 和 A[j] 之间），算法可以进入两种状态之一 - 其中一种状态是 A[i] < 。 A[j]和其中A[i]＞1的一个。 A[j]。更一般地，在最好的情况下，算法进行的每次比较都可以根据比较结果将算法置于两个新状态之一。因此，我们可以想到一个大型二叉树结构，描述算法可以处于的状态 - 每个状态最多有两个子状态，根据比较的结果描述算法进入的状态。如果我们采用从树根到叶子的任何路径，我们就会得到一系列比较，最终由算法对特定输入进行比较。为了尽可能快地排序，我们希望进行尽可能少的比较，因此我们希望这个树结构具有尽可能小的高度。

现在，我们知道两件事。首先，我们可以将算法可以进入的所有状态视为二叉树。其次，该二叉树中必须至少有 f(n) 个不同的节点。鉴于此，我们可以构建的最小可能的二叉树的高度必须至少为 Ω(log f(n))。这意味着，如果有 f(n) 种可能的不同方式对数组元素进行排序，我们必须平均进行至少 Ω(log f(n)) 次比较，否则我们无法进入足够多的不同状态。

为了证明你无法打败 Ω(n log n)，请注意，如果数组中有 n 个不同的元素，那么就有 n！对元素进行排序的不同可能方式。使用斯特林近似，我们得到了log n！ = Ω(n log n)，因此在平均情况下我们必须至少进行 Ω(n log n) 次比较才能对输入序列进行排序。

规则的例外

在上面我们刚刚看到的内容中，我们看到如果有 n 个全部不同的数组元素，则使用比较排序对它们进行排序的速度不能比 Ω(n log n) 更快。然而，这个最初的假设并不一定有效。我们想要排序的许多数组中可能包含重复的元素。例如，假设我想要对仅由 0 和 1 组成的数组进行排序，例如这里的这个数组：

 0 1 0 1 1 1 0 0 1 1 1

在这种情况下，有 n 个是不。不同的 0 和 1 长度为 n 的数组。事实上，只有 2ⁿ 个。从上面的结果来看，这意味着我们应该能够使用纯粹基于比较的排序算法在 Ω(log 2ⁿ) = Ω(n) 时间内进行排序。事实上，我们完全可以做到这一点；下面是我们如何做到这一点的一个草图：

看看第一个元素。
将小于第一个元素的所有元素复制到名为“less”的数组中
将等于第一个元素的所有元素复制到名为“equal”的数组中
将大于第一个元素的所有元素复制到名为“greater”的数组中
连接所有三个元素数组按照小于、等于、大于的顺序排列在一起。

要看到它是否有效，如果 0 是我们的第一个元素，那么“less”数组将为空，“equal”数组将包含所有零，“greater”数组将包含所有 1。将它们连接起来，然后将所有的零放在所有的前面。否则，如果 1 是我们的第一个元素，则 less 数组将保存零，equal 数组将保存 1，而 greater数组将为空。因此，根据需要，它们的串联是全零后跟全一。

在实践中，您不会使用此算法（您将使用计数排序，如下所述），但它表明您确实可以使用基于比较的算法击败 Ω(n log n) 如果可能的输入数量对算法的影响很小。

众所周知，一些基于比较的排序算法可以非常快速地处理具有多个重复值的输入。例如，众所周知，具有特殊分区步骤的快速排序可以利用输入数组中的重复元素。

非比较排序

所有这些讨论都假设我们正在讨论基于比较的排序，其中对数组元素唯一允许的操作是比较。但是，如果我们更多地了解要排序的元素，并且可以对这些元素执行除简单比较之外的操作，那么上述界限都不再成立。我们打破了最初的假设，这些假设导致我们构建了算法所有状态的二叉树，因此没有理由怀疑这些界限仍然成立。

例如，如果您知道输入值是从仅具有 |U| 的 Universe 中提取的其中的元素，然后您可以使用巧妙的算法在 O(n + |U|) 时间内排序。首先创建 |U|我们可以将原始数组中的元素放入不同的桶中。然后，迭代数组并将所有数组元素分配到相应的存储桶中。最后，访问每个存储桶，从保存最小元素副本的存储桶开始，到包含最大元素副本的存储桶结束，然后将找到的所有值连接在一起。例如，让我们看看如何对由值 1 - 5 组成的数组进行排序。如果我们有这个起始数组：

1 3 4 5 2 3 2 1 4 3 5

那么我们可以将这些元素放入桶中，如下所示：

Bucket     1  2  3  4  5
           -------------
           1  2  3  4  5
           1  2  3  4  5
                 3

迭代桶并将它们的值连接在一起会产生以下结果

1 1 2 2 3 3 3 4 4 5 5

：足够了，是我们原始数组的排序版本！这里的运行时间是 O(n) 时间将原始数组元素分配到存储桶中，然后是 O(n + |U|) 时间迭代所有存储桶将元素重新组合在一起。请注意，如果 |U| = O(n)，这在 O(n) 时间内运行，打破了 Ω(n log n) 排序障碍。

如果您要对整数进行排序，则可以使用基数排序做得更好，它运行时间复杂度为 O(n lg |U|)。如果您正在处理原始 int，则 lg |U|通常是 32 或 64，所以速度非常快。如果您愿意实现一个特别棘手的数据结构，您可以使用 van Emde Boas 树在 O(n lg lg U) 时间内对整数从 0 到 U - 1 进行排序，再次利用整数由可以在块中操作的位组组成的事实。

同样，如果您知道元素是字符串，则可以通过构建 trie 来快速排序从字符串中取出，然后迭代特里树以重建所有字符串。或者，您可以将字符串视为以大基数（例如 ASCII 文本的基数 128）编写的数字，然后使用上面的整数排序算法之一。

在每种情况下，您能够克服信息论障碍的原因是您打破了障碍的起始假设，即您只能应用比较。如果您可以将输入元素视为数字、字符串或任何其他能够揭示更多结构的元素，那么所有的赌注都会被取消，您可以非常有效地进行排序。

To directly answer your question:

Your sorting algorithm is technically not O(n) but rather O(n + max), since you need to create an array of size max, which takes O(max) time.
This isn't a problem; in fact, it's a special case of a well-known sorting algorithm that breaks the Ω(n log n) barrier.

So what is this Ω(n log n) barrier? Where does it come from? And how do you break it?

The Ω(n log n) Barrier

The Ω(n log n) barrier is the information-theoretical lower bound on the average-case speed of any comparison-based sorting algorithm. If the only operations you are permitted to apply to array elements to distinguish them is to perform some sort of comparison, then your sorting algorithm can't do better than Ω(n log n) in the average-case.

To understand why this is, let's think about the state of the algorithm at any point during its execution. As the algorithm is running, it can gain some amount of information about the way that the input elements were ordered. Let's say that if the algorithm has some set of information X about the original ordering of the input elements, then the algorithm is in state X.

The crux of the Ω(n log n) argument (and several related arguments, as I'll discuss later) is that the algorithm has to have the ability to get into a large number of different states based on what the input is. Let's assume for now that the input to the sorting algorithm is an array of n distinct values. Because the algorithm can't tell anything about those elements other than the way that they're ordered, it doesn't really matter what the values being sorted are. All that matters is the relative ordering of those n elements relative to one another.

Now for the key step - let's suppose that there are f(n) unique ways of ordering the n input elements and suppose that our sorting algorithm can't get into at least f(n) different states. If this is the case, then there has to be two different orderings of the elements in the array that the algorithm always groups together into the same state. If this happens, then the sorting algorithm can't possibly correctly sort both of the two input arrays correctly. The reasoning behind this is that because the algorithm treats the two arrays identically, whatever steps it uses to reorder the elements of the first array will be the same as the steps it uses to reorder the elements of the second array. Since the two arrays aren't the same, there has to be at least one element that will be out of place in one of the two cases. Consequently, we know that the sorting algorithm has to be able to get into f(n) different states.

But how can the algorithm get into these different states? Well, let's think about this. Initially, the algorithm has no information at all about the ordering of the elements. When it makes its first comparison (say, between elements A[i] and A[j]), the algorithm can get into one of two states - one where A[i] < A[j] and one where A[i] > A[j]. More generally, every comparison that the algorithm makes can, in the best case, put the algorithm into one of two new states based on the result of the comparison. We can therefore think of a large binary tree structure describing the states that the algorithm can be in - each state has up to two children describing what state the algorithm gets into based on the result of the comparison that's made. If we take any path from the root of the tree down to a leaf, we get the series of comparisons that end up getting made by the algorithm on a particular input. In order to sort as quickly as possible, we want to make the fewest number of comparisons possible, and so we want this tree structure to have the smallest height possible.

Now, we know two things. First, we can think of all of the states the algorithm can get into as a binary tree. Second, that binary tree has to have at least f(n) different nodes in it. Given this, the smallest possible binary tree we can build has to have height at least Ω(log f(n)). This means that if there are f(n) different possible ways of ordering the array elements, we have to make at least Ω(log f(n)) comparisons on average, since otherwise we can't get into enough differing states.

To conclude the proof that you can't beat Ω(n log n), note that if the array has n distinct elements in it, then there are n! different possible ways of ordering the elements. using Stirling's approximation, we have that log n! = Ω(n log n), and so we have to make at least Ω(n log n) comparisons in the average case to sort the input sequence.

Exceptions to the Rule

In what we just saw above, we saw that if you have n array elements that are all distinct, you cannot sort them with a comparison sort any faster than Ω(n log n). However, this starting assumption isn't necessarily valid. Many arrays that we'd like to sort may have duplicated elements in them. For example, suppose that I want to sort arrays that are composed solely of zeros and ones, such as this array here:

 0 1 0 1 1 1 0 0 1 1 1

In this case, it is not true that there are n! different arrays of zeros and ones of length n. In fact, there are only 2ⁿ of them. From our result above, this means that we should be able to sort in Ω(log 2ⁿ) = Ω(n) time using a purely comparison-based sorting algorithm. In fact, we absolutely can do this; here's a sketch of how we'd do it:

Look at the first element.
Copy all elements less than the first element into an array called 'less'
Copy all elements equal to the first element into an array called 'equal'
Copy all elements greater than the first element into an array called 'greater'
Concatenate all three of these arrays together in the order less, equal, greater.

To see that this works, if 0 is our first element, then the 'less' array will be empty, the 'equal' array will have all the zeros, and the 'greater' array will have all the ones. Concatenating them then puts all the zeros before all the ones. Otherwise, if 1 is our first element, then the less array will hold the zeros, the equal array will hold the ones, and the greater array will be empty. Their concatenation is thus all zeros followed by all ones, as required.

In practice, you wouldn't use this algorithm (you'd use a counting sort, as described below), but it shows that you can indeed beat Ω(n log n) with a comparison-based algorithm if the number of possible inputs to the algorithm is small.

Some comparison-based sorting algorithms are known to work very quickly on inputs that have multiple duplicated values. For example, it is known that Quicksort with a special partitioning step can take advantage of duplicated elements in the input array.

Non-Comparison Sorts

All of this discussion has assumed that we're talking about comparison-based sorting, where the only permitted operation on array elements is a comparison. However, if we know more about what elements we're going to be sorting and can perform operations on those elements beyond simple comparisons, then none of the above bounds hold any more. We're breaking the starting assumptions that led us to construct a binary tree of all the states of the algorithm, and so there's no reason to suspect that those bounds will still hold.

For example, if you know that the input values are drawn from a universe that only has |U| elements in it, then you can sort in O(n + |U|) time using a clever algorithm. Start off by creating |U| different buckets into which we can place the elements from the original array. Then, iterate across the array and distribute all of the array elements into the corresponding bucket. Finally, visit each of the buckets, starting with the bucket holding copies of the smallest element and end with the bucket containing copies of the largest element, then concatenate together all of the values you find. For example, let's see how to sort arrays consisting of the values 1 - 5. If we have this starting array:

1 3 4 5 2 3 2 1 4 3 5

Then we can put those elements into buckets like this:

Bucket     1  2  3  4  5
           -------------
           1  2  3  4  5
           1  2  3  4  5
                 3

Iterating across the buckets and concatenating their values together yields this:

1 1 2 2 3 3 3 4 4 5 5

which, sure enough, is a sorted version of our original array! The runtime here is O(n) time to go and distribute the original array elements into the buckets, then O(n + |U|) time to iterate across all the buckets putting the elements back together. Notice that if |U| = O(n), this runs in O(n) time, breaking the Ω(n log n) sorting barrier.

If you are sorting integers, you can do much better than this by using radix sort, which runs in O(n lg |U|). If you're dealing with primitive ints, lg |U| is usually 32 or 64, so this is extremely fast. If you're willing to implement a particularly tricky data structure, you can use a van Emde Boas Tree to sort integers from 0 to U - 1 in time O(n lg lg U), again by exploiting the fact that integers consist of groups of bits that can be manipulated in blocks.

Similarly, if you know that your elements are strings, you can sort very quickly by building a trie out of the strings, then iterating across the trie to rebuild all the strings. Alternatively, you could consider the strings as numbers written in a large base (say, base 128 for ASCII text) and then use one of the integer sorting algorithms from above.

In each of these cases, the reason that you can beat the information-theoretic barrier is that you're breaking the barrier's starting assumption, namely that you can only apply comparisons. If you can treat the input elements as numbers, or as strings, or as anything else that reveals more structure, all bets are off and you can sort extremely efficiently.

回复收藏 0 原文