当前位置：文江博客话题详情

Java 7 是否使用 Tim Sort 作为方法 Arrays.Sort？

发布于 2024-09-28 16:15:52 字数 95 浏览 9 评论 0原文

我找不到Java 7的文档，我只能找到关于Java 6的信息，它仍然是快速或合并的。有谁知道如何找到Java 7中Arrays.sort方法的文档吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

静水深流 2024-10-05 16:15:52

Java 7 到 Java 22 使用双轴快速排序 (Dual-Pivot Quicksort) 来处理基元，使用 TimSort 来处理对象。

根据原语的 Java 7 API 文档：

实现说明：排序
算法是双枢轴快速排序
弗拉基米尔·雅罗斯拉夫斯基、乔恩·本特利、
和约书亚·布洛赫。这个算法
在许多方面提供 O(n log(n)) 性能
导致其他快速排序的数据集
降低到二次性能，
并且通常比
传统（单枢）快速排序
实现。

根据对象的 Java 7 API 文档：

该实施改编自
Tim Peters 的 Python 列表排序（
TimSort）。它使用了 Peter 的技术
麦克罗伊的“乐观排序和
信息论复杂性”，载于
第四届年会论文集
ACM-SIAM 离散研讨会
算法，第 467-474 页，1993 年 1 月。

Timsort 是一种混合“合并排序和插入排序” ”。

同上在 Java 22 中。

不确定这是否与 Java 6 中的情况有很大不同，对于 Arrays.sort JDK6：

调整后的快速排序，改编自 Jon L.
本特利和道格拉斯·麦克罗伊
“设计一个排序函数”，
软件实践与经验，卷。
23(11) P. 1249-1265（1993 年 11 月）

对于 Object[] 或集合 (Collections.sort())，使用合并排序。

回复收藏 0 原文

意犹 2024-10-05 16:15:52

是的！ ... 也没有。

总结

在撰写本文时（2016 年），在当前的 Open JDK ⁰ 实现中，Tim Sort 通常用于对对象数组进行排序（即 数组。 sort(Object[]) 和朋友） - 但对于原始数组（Arrays.sort方法的其余部分）还有各种其他方法被使用。

对于基元，启发式方法会在各种排序方法中进行选择，例如快速排序、合并排序、计数排序³。取决于正在排序的数据。大多数这些决策都是根据要排序的数组的类型和大小预先做出的，但对于 int 和 long 元素，该决策实际上是根据测量数组的有序性。因此，在许多情况下，在适应/内省（TimSort 或类似的合并排序）之上，您还需要适应/内省（选择算法的启发式方法）！

详细信息

Tim Sort 用于大多数类型的对象，例如 Arrays.sort(Object[] a) ，除非用户通过设置系统属性 java.util 专门请求旧行为。 Arrays.useLegacyMergeSort 为 true。

对于基元来说，情况更为复杂。至少从 JDK 8（版本 1.8.0_111）开始，根据要排序的数组的大小、原始类型和测量的数组“排序度”，使用各种启发式方法。概述如下：

对于除 bytes¹ 之外的所有基元类型，少于 47 个元素的数组只需使用插入排序进行排序（请参阅 DualPivotQuicksort.INSERTION_SORT_THRESHOLD）。在对使用合并或快速排序时出现的子数组进行排序并且子数组的大小低于阈值时，也会使用此阈值。因此，某种形式的插入排序将用于所有排序，并且对于小型数组，它是唯一使用的算法。
对于原始类型 byte、short 和 char，计数排序用于较大的数组。这是一种简单的排序，需要 O(n + range) 时间，其中 range 是字节 (256) 或短/字符 (65536) 值的总数。排序涉及分配范围值的底层数组，因此仅当要排序的元素数量占总范围的很大一部分时才使用它。特别是，它用于大于 29 个元素的字节数组（即范围的 11%）和大于 3200 个元素的短/字符数组（范围的 5%）。
对于字节数组，始终使用上述两种方法之一。
对于高于插入排序阈值的 int 和 long 数组以及高于插入排序阈值的 short/char 数组并且低于计数排序阈值，可以使用两种算法之一：双主元快速排序或合并排序。使用哪一个取决于数组排序的度量：输入被分为升序或降序元素的运行。如果此类运行的次数大于 66，则该数组被认为大部分未排序，并使用双枢轴快速排序进行排序。否则，数组被认为大部分已排序，并使用归并排序（使用已枚举的运行作为起点）。

查找运行然后使用归并排序对它们进行排序的想法实际上与 TimSort 非常相似，尽管存在一些差异。因此，至少对于某些参数，JDK 使用运行感知归并排序，但对于许多其他参数组合，它使用不同的算法，并且总共至少使用 5 种不同的算法！

基本原理

Object[] 与基元的不同排序行为背后的原因可能至少有两个方面：

1) Object[] 的排序必须是 stable：同等排序的对象将以与输入相同的顺序出现。对于原始数组，不存在这样的概念：原始数组完全由它们的值定义，因此稳定排序和不稳定排序之间没有区别。这使得原始排序不需要稳定的算法来提高速度。

2) 各种Object[]需要涉及Object.compare()方法，该方法可能非常复杂且昂贵。即使 compare() 方法很简单，通常也会存在方法调用开销，除非可以内联整个排序方法²。因此，Object[] 通常会偏向于最小化总比较，即使以一些额外的算法复杂性为代价。

另一方面，原始数组的排序只是直接比较原始值，通常需要一个或两个周期的数量级。在这种情况下，应该考虑比较成本和周围算法来优化算法，因为它们可能具有相同的量级。

⁰ 至少对于 Java 7 和 Java 9 之间的版本来说是这样，当然这也包括 Oracle 的 JDK，因为它基于 Open JDK。其他实现很可能使用类似的方法，但我没有检查过。

¹ 对于字节数组，插入排序阈值实际上是 29 个元素，因为这是使用计数排序的下限。

² 这似乎不太可能，因为它相当大。

³ 计数排序仅适用于 16 位或更少范围相对有限的值：byte、short 或 char代码>.

Yes! ... and also no.

Summary

At the time of writing (2016), in current Open JDK ⁰ implementations Tim Sort is generally used for sorting arrays of objects (i.e., Arrays.sort(Object[]) and friends) - but for primitive arrays (the remainder of the Arrays.sort methods) a variety of other methods are used.

For primitives, the heuristics choose among a variety of sorting methods such as quicksort, merge sort, counting sort³. depending on the data being sorted. Most of these decisions are simply made up-front based on the type and size of the array being sorted, but for int and long elements the decision is actually adaptive based on the measured sortedness of the array. So you have adaptation/introspection (the heuristics to pick an algorithm) on top of adaptation/introspection (TimSort or similar merge sort) in many cases!

Details

Tim Sort is used for most sorts of objects, such as Arrays.sort(Object[] a), unless the user has specifically requested the legacy behavior by setting system property java.util.Arrays.useLegacyMergeSort to true.

For primitives, the situation is more complex. At least as of JDK 8 (version 1.8.0_111) a variety of heurstics are used depending on the size of the arrays being sorted, the primitive type and the measured "sortedness" of the array. Here's an overview:

For all primitive types other than bytes¹, arrays of less than 47 elements are simply sorted using insertion sort (see DualPivotQuicksort.INSERTION_SORT_THRESHOLD). This threshold is also used when sorting sub-arrays that arise when merge or quicksort are used and the size of the subarray falls below the threshold. So some form of insertion sort will be used in all sorts, and for small arrays it is the only algorithm used.
For primitive types byte, short and char, a counting sort is used for largish arrays. This is a simple sort that takes O(n + range) time, where range is the total number of byte (256) or short/char (65536) values. The sort involves allocating an underlying array of range values, so it is only used when the number of elements to sort is a significant fraction of the total range. In particular, it is used for byte arrays greater than 29 elements (i.e. ~11% of the range), and short/char arrays greater than 3200 elements (~5% of the range).
For byte arrays, one of the two approaches above is always used.
For int and long arrays above the insertion sort threshold and for short/char arrays both above the insertion sort threshold and below the counting sort threshold, one of two algorithms may be used: dual pivot quicksort, or merge sort. Which one is used depends on a measure of the sortedness of the array: the input is divided into runs of ascending or descending elements. If the number of such runs is greater than 66, then the array is considered mostly unsorted and is sorted with dual-pivot quicksort. Otherwise, the array is considered mostly sorted, and mergesort is used (using the already enumerated runs as a starting point).

The idea of finding runs and then using mergesort to sort them is in fact very similar to TimSort, although there are some differences. So at least for some parameters, the JDK is using a run-aware mergesort, but for many other combinations of parameters it is using a different algorithm, and at least 5 distinct algorithms are used in total!

Rationale

The reasoning behind the different sort behavior of Object[] versus primitive is probably at least two-fold:

1) Sorts of Object[] are required to be stable: objects which sort equally will appear in the same order as the input. For primitive arrays, no such concept exists: primitives are fully defined by their value, so there is no distinction between a stable and an unstable sort. This allows primitive sorts to dispense with the need for stable algorithms in favor of speed.

2) Sorts of Object[] need to involve the Object.compare() method, which may be arbitrarily complex and expensive. Even if the compare() method is simple, there will generally be method call overhead unless the entire sort method can be inlined². So sorts of Object[] will generally be biased towards minimizing total comparisons, even at the cost of some additional algorithmic complexity.

Sorts of primitive arrays, on the other hand, just directly compare primitive values which typically takes on the order of a cycle or two. In this case, the algorithm should be optimized considering both the cost of comparisons and the surrounding algorithm, since the are likely to be of the same magnitude.

⁰ At least for versions between Java 7 and Java 9, and of course this also includes Oracle's JDK as it is based on Open JDK. It is likely that other implementations use a similar approach, but I haven't checked.

¹ For byte arrays, the insertion sort threshold is effectively 29 elements since that's the lower cutoff above which counting sort is used.

² This seems unlikely, since it is quite large.

³ Counting sort is only used for values with relatively limited range of 16-bits or less: byte, short or char.

回复收藏 0 原文