分区比排序更容易吗?

发布于 2024-09-10 05:07:34 字数 219 浏览 2 评论 0原文

这是一个在我脑海中徘徊了一段时间的问题......

假设我有一个项目列表和它们的等价关系,并且比较两个项目需要恒定的时间。 我想返回项目的分区,例如链表列表,每个链表包含所有等效项目。

实现此目的的一种方法是将等价性扩展到项目的排序并对其进行排序(使用排序算法);那么所有等价的项目将是相邻的。

但它能比排序更有效吗?这个问题的时间复杂度是不是比排序低?如果没有,为什么不呢?

This is a question that's been lingering in my mind for some time ...

Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time.
I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.

One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.

But can it be done more efficiently than with sorting? Is the time complexity of this problem lower than that of sorting? If not, why not?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

仅此而已 2024-09-17 05:07:34

您似乎在这里同时问两个不同的问题。

1)如果只允许相等性检查,是否比我们进行一些排序更容易分区?答案是,不。您需要 Omega(n^2) 比较来确定最坏情况下的分区(例如,所有情况都不同)。

2)如果允许排序,分区是否比排序更容易?答案再次是否定的。这是因为元素独特性问题。也就是说,为了确定所有对象是否不同,您需要 Omega(nlogn) 比较。由于排序可以在 O(nlogn) 时间内完成(并且也有 Omega(nlogn) 下界)并解决分区问题,因此渐近地它们同样困难。

如果您选择任意哈希函数,则相等的对象不需要具有相同的哈希值,在这种情况下,您没有通过将它们放入哈希表来完成任何有用的工作。

即使您确实想出了这样的散列(保证相同的对象具有相同的散列),好的散列的时间复杂度预计为 O(n),最坏的情况是 Omega(n^2 )。

是否使用散列或排序完全取决于问题中不可用的其他约束。

其他答案似乎也忘记了您的问题(主要)是关于比较分区和排序!

You seem to be asking two different questions at one go here.

1) If allowing only equality checks, does it make partition easier than if we had some ordering? The answer is, no. You require Omega(n^2) comparisons to determine the partitioning in the worst case (all different for instance).

2) If allowing ordering, is partitioning easier than sorting? The answer again is no. This is because of the Element Distinctness Problem. Which says that in order to even determine if all objects are distinct, you require Omega(nlogn) comparisons. Since sorting can be done in O(nlogn) time (and also have Omega(nlogn) lower bounds) and solves the partition problem, asymptotically they are equally hard.

If you pick an arbitrary hash function, equal objects need not have the same hash, in which case you haven't done any useful work by putting them in a hashtable.

Even if you do come up with such a hash (equal objects guaranteed to have the same hash), the time complexity is expected O(n) for good hashes, and worst case is Omega(n^2).

Whether to use hashing or sorting completely depends on other constraints not available in the question.

The other answers also seem to be forgetting that your question is (mainly) about comparing partitioning and sorting!

他夏了夏天 2024-09-17 05:07:34

如果您可以为项目定义散列函数以及等价关系,那么您应该能够在线性时间内进行分区 - 假设计算散列是恒定时间。哈希函数必须将等效项映射到相同的哈希值。

如果没有哈希函数,您将必须将要插入分区列表中的每个新项目与每个现有列表的头部进行比较。该策略的效率取决于最终会有多少个分区。

假设您有 100 个项目,它们最终将被划分为 3 个列表。然后,在将每个项目插入到其中一个列表之前,必须将其与最多 3 个其他项目进行比较。

然而,如果这 100 个项目最终被划分为 90 个列表(即,相当的项目很少),那就是另一回事了。现在你的运行时间更接近二次而不是线性。

If you can define a hash function for the items as well as an equivalence relation, then you should be able to do the partition in linear time -- assuming computing the hash is constant time. The hash function must map equivalent items to the same hash value.

Without a hash function, you would have to compare every new item to be inserted into the partitioned lists against the head of each existing list. The efficiency of that strategy depends on how many partitions there will eventually be.

Let's say you have 100 items, and they will eventually be partitioned into 3 lists. Then each item would have to be compared against at most 3 other items before inserting it into one of the lists.

However, if those 100 items would eventually be partitioned into 90 lists (i.e., very few equivalent items), it's a different story. Now your runtime is closer to quadratic than linear.

甜味拾荒者 2024-09-17 05:07:34

如果您不关心等价集的最终排序,那么划分等价集可能会更快。但是,这取决于算法和每个集合中的元素数量。

如果每个集合中的项目很少,那么您不妨对元素进行排序,然后找到相邻的相等元素。对于 n 个元素,一个好的排序算法是 O(n log n)。

如果有几个集合,每个集合中有很多元素,那么您可以取出每个元素,并与现有集合进行比较。如果它属于其中一个,则添加它,否则创建一个新集合。这将是 O(n*m),其中 n 是元素的数量,m 是等价集的数量,对于大 n 和小 m,它小于 O(n log n),但当 m 趋向于 n 时会更糟。

组合的排序/分区算法可能会更快。

If you don't care about the final ordering of the equivalence sets, then partitioning into equivalence sets could be quicker. However, it depends on the algorithm and the numbers of elements in each set.

If there are very few items in each set, then you might as well just sort the elements and then find the adjacent equal elements. A good sorting algorithm is O(n log n) for n elements.

If there are a few sets with lots of elements in each then you can take each element, and compare to the existing sets. If it belongs in one of them then add it, otherwise create a new set. This will be O(n*m) where n is the number of elements, and m is the number of equivalence sets, which is less then O(n log n) for large n and small m, but worse as m tends to n.

A combined sorting/partitioning algorithm may be quicker.

秋心╮凉 2024-09-17 05:07:34

如果必须使用比较器,则下限是用于排序或分区的 Ω(n log n) 比较。原因是所有元素都必须检查 Ω(n),并且比较器必须对每个元素执行 log n 比较,以唯一地标识或放置该元素相对于其他元素的位置(每次比较将空间分为 2,因此对于空间大小为 n,需要进行 log n 次比较。)

如果每个元素可以与在恒定时间内导出的唯一键相关联,则下界为 Ω(n),用于排序 ant 分区(参见 RadixSort)

If a comparator must be used, then the lower bound is Ω(n log n) comparisons for sorting or partitioning. The reason is all elements must be inspected Ω(n), and a comparator must perform log n comparisons for each element to uniquely identify or place that element in relation to the others (each comparison divides the space in 2, and so for a space of size n, log n comparisons are needed.)

If each element can be associated with a unique key which is derived in constant time, then the lowerbound is Ω(n), for sorting ant partitioning (c.f. RadixSort)

写下不归期 2024-09-17 05:07:34

基于比较的排序通常具有 O(n log n) 的下限。

假设您迭代一组项目并将它们放入具有相同比较值的项目的存储桶中,例如一组列表(例如使用哈希集)。即使在从集合中检索列表列表之后,此操作显然也是 O(n)。

--- 编辑: ---

这当然需要两个假设:

  • 每个要分区的元素都存在一个恒定时间哈希算法。
  • 桶的数量不取决于输入量。

因此,划分的下界是 O(n)。

Comparison based sorting generally has a lower bound of O(n log n).

Assume you iterate over your set of items and put them in buckets with items with the same comparative value, for example in a set of lists (say using a hash set). This operation is clearly O(n), even after retreiving the list of lists from the set.

--- EDIT: ---

This of course requires two assumptions:

  • There exists a constant time hash-algorithm for each element to be partitioned.
  • The number of buckets does not depend on the amount of input.

Thus, the lower bound of partitioning is O(n).

泪痕残 2024-09-17 05:07:34

一般来说,分区比排序更快,因为您不必将每个元素与每个可能等效的已排序元素进行比较,只需将其与分区中已建立的键进行比较。仔细看看基数排序。基数排序的第一步是根据键的某些部分对输入进行分区。基数排序是 O(kN)。如果您的数据集具有受给定长度 k 限制的键,则可以对其进行基数排序 O(n)。如果您的数据具有可比性并且没有有界键,但您选择了一个有界键来对集合进行分区,则对集合进行排序的复杂度将为 O(n log n),而分区的复杂度将为 O(n) 。

Partitioning is faster than sorting, in general, because you don't have to compare each element to each potentially-equivalent already-sorted element, you only have to compare it to the already-established keys of your partitioning. Take a close look at radix sort. The first step of radix sort is to partition the input based on some part of the key. Radix sort is O(kN). If your data set has keys bounded by a given length k, you can radix sort it O(n). If your data are comparable and don't have a bounded key, but you choose a bounded key with which to partition the set, the complexity of sorting the set would be O(n log n) and the partitioning would be O(n).

一瞬间的火花 2024-09-17 05:07:34

这是数据结构中的一个经典问题,是的,它比排序更容易。如果您还想快速查找每个元素属于哪个集合,那么您需要的是不相交集合数据结构以及并集查找操作。请参阅此处:http://en.wikipedia.org/wiki/Disjoint-set_data_struct

This is a classic problem in data structures, and yes, it is easier than sorting. If you want to also quickly be able to look up which set each element belongs to, what you want is the disjoint set data structure, together with the union-find operation. See here: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

太阳公公是暖光 2024-09-17 05:07:34

使用哈希函数执行可能不完美的分区所需的时间将为 O(n+bucketcount) [而不是 O(n*bucketcount)]。使存储桶计数足够大以避免所有冲突的成本将会很高,但如果哈希函数运行良好,每个存储桶中应该有少量不同的值。如果可以轻松生成多个统计上独立的哈希函数,则可以采用每个键与第一个键不完全匹配的存储桶,并使用另一个哈希函数来分区该存储桶的内容。

假设每个步骤上的桶数恒定,时间将为 O(NlgN),但如果将桶数设置为 sqrt(N) 之类的值,则平均传递次数应为 O(1),并且每一遍的工作时间为 O(n)。

The time required to perform a possibly-imperfect partition using a hash function will be O(n+bucketcount) [not O(n*bucketcount)]. Making the bucket count large enough to avoid all collisions will be expensive, but if the hash function works at all well there should be a small number of distinct values in each bucket. If one can easily generate multiple statistically-independent hash functions, one could take each bucket whose keys don't all match the first one and use another hash function to partition the contents of that bucket.

Assuming a constant number of buckets on each step, the time is going to be O(NlgN), but if one sets the number of buckets to something like sqrt(N), the average number of passes should be O(1) and the work in each pass O(n).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文