哪种排序算法可以最快地提供一页结果? (部分结果集)

发布于 2024-12-23 07:13:17 字数 598 浏览 3 评论 0原文

我有一个“大型”数据集,需要显示前 10 行数据或最后 10 行数据,并允许排序操作在用户查看第一页结果时在后台运行。

编辑:有关“大”含义的详细信息

我正在将多个主机的系统日志和事件日志数据收集到可搜索的存储库中。由于我将查看以不同时间间隔突发/垃圾邮件事件日志数据的 N 台计算机,因此如果我搜索的项目不是按默认顺序 Machine\Log\Event DateTime

根据我收到的答案,我可以在插入数据时填充二级索引,以便初始视图非常有效。

我认为,当用户 80% 的时间只关心前 10 个条目或最后 10 个条目时,先对所有数据进行排序,然后交付整个结果集是低效的。

提供部分结果集并在后台继续处理的最佳算法是什么?

根据此示例,堆、快速排序和 shell 似乎提供了最佳性能,并且可能提供一种开箱即用的结果页面。

我如何判断这些算法何时准备好为第一页提供服务?我会关注什么门槛?

I have a "large" dataset where I need to display the first or last 10 rows of data and allow the sort operation to operate in the background as the user views the first page of results.

Edit: Details on what "large" means

I am collecting Syslog and EventLog data for several hosts into a searchable repository. Since I will be looking at N computers that burst/spam event log data in varying intervals the item that I search upon can grow quickly if it's not in the default order of Machine\Log\Event DateTime

Based on the answer I receive I may populate a secondary index upon inserting the data so that the initial view is very efficient.

I think it would be inefficient to sort all the data first, and then deliver the entire result set when 80% of the time the user only cares about the first or last 10 entries.

What is the best algorithm to deliver a partial result set, and continue processing in the background?

Based on this sample, the heap, quick sort, and shell seem to offer the best performance and might offer one page of results out of the box.

How would I tell when any of those algorithms are ready to serve the first page? What threshold would I watch?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

伴我老 2024-12-30 07:13:17

您可以在 O(n) 时间内选择前 K 个项目。请参阅http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements。您可以使用快速选择算法选择前 10 个(这会将它们放在数组的前面),对后 10 个再次执行此操作(放在数组的末尾),然后在后台运行排序,对项目 10 到 n-10 进行排序。

实际上,当要选择的项目数小于总项目数的 1% 时,Heapselect 比 Quickselect 更快。即从n列表中选择k个项目,如果k k 则应使用Heapselect。 n/100。如果k是10并且n是一百万,Heapselect将比Quickselect快得多。

Heapselect 的缺点是需要 O(k) 的额外空间。但当k == 10时,这就不是什么大问题了。

这取决于您的数据的性质。如果要显示的总行数通常超过 1,000,那么您应该使用 Heapselect。否则,请使用快速选择。它们都很容易实现。

有关更多信息,请参阅当理论遇见实践两种选择算法之间的差异。

You can select the top K items in O(n) time. See http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements. You could use a Quickselect algorithm to select the top 10 (which would put them at the front of the array), do it again for the bottom 10 (put at the end of the array), and then have the sort run in the background, sorting items 10 through n-10.

In practice, Heapselect is faster than Quickselect when the number of items you want to select is less than 1% of the total number of items. That is to select k items from a list of n, you should use Heapselect if k < n/100. If k is 10 and n is a million, Heapselect is going to be hugely faster than Quickselect.

The disadvantage of Heapselect is that it requires O(k) extra space. But when k == 10, that's not a big deal.

It depends on the nature of your data. If the total number of rows to be displayed is usually more than 1,000, then you should use Heapselect. Otherwise, use Quickselect. They're both easy to implement.

See When Theory Meets Practice for more information on the difference between the two selection algorithms.

廻憶裏菂餘溫 2024-12-30 07:13:17

根据此示例,堆、快速排序和 shell 似乎提供了
最佳性能,并可能提供一页结果
框。

为此,您将需要一种按顺序对列表进行排序的算法。列表的每次迭代都会将下一个最大(或最小)元素放置在列表中的正确位置。因此,在第一遍之后,最小的元素位于位置 1,第二遍之后,第二小的元素位于位置 2。为此,您将需要类似选择排序的东西]2

与其他算法相比,在数据顺序方面存在的问题是,这些算法的性能可能远远优于其他算法。因此,即使您在排序后“快速”获取前 10 条记录,另一种算法也可能在相同的时间内对整个列表进行排序。

Based on this sample, the heap, quick sort, and shell seem to offer
the best performance and might offer one page of results out of the
box.

To do this you are going to need an algorithm that sorts the list in order. Each iteration of of the list places the next largest (or smallest) element in the list it's proper place. So, after the first pass, the smallest element is in position 1, and after the second pass the second smallest is in position 2. To do this you are going to need something like a Selection Sort]2.

Here's the problem compared to other algorithms in and data order, these can be vastly outperformed. So even though you are "quickly" grabbing the first 10 records after they are sorted, another algorithm might have sorted the entire list in the same amount of time.

于我来说 2024-12-30 07:13:17

使用堆或最终使用优先级队列,可以在 O(nlogk) 时间内找到前 K 个项目,这比 O(nlogn) 快得多。
策略是遍历列表一次,然后保留迄今为止找到的前 k 个元素的列表。为了有效地做到这一点,您必须始终知道此 top-k 中的最小元素,因此您可以用更大的元素替换它。堆结构使得维护这个列表变得很容易,而不需要浪费任何精力。

您还可以使用 http://en.wikipedia.org/wiki/Selection_algorithm 中提到的选择算法用于查找列表中前 k 个最小或最大的条目。这比对整个列表进行排序要快。

我建议您首先对 k 个条目进行排序并显示结果,然后在后台使用合并、堆或快速排序对剩余条目进行排序。

Finding the top K items can be done in O(nlogk) time, which is much, much faster than O(nlogn), using a heap Or eventually a priority queue.
The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort.

You can also use selection algorithm mentioned at http://en.wikipedia.org/wiki/Selection_algorithm for finding the first k smallest or largest entries in your list. which is faster that sorting the entire list.

I would suggest that you sort k entries first and display the result and in back ground sort the remaining entries using merge, heap or quick sort.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文