在 C# 中计算数组频率分布的最快方法是什么?
我只是想知道该计算的最佳方法是什么。假设我有一个输入值数组和边界数组 - 我想计算/分桶边界数组中每个段的频率分布。
使用桶搜索是个好主意吗?
实际上我发现了这个问题 CalculateFrequency Distribution of a Collection with .Net /C#
但我不明白如何为此目的使用存储桶,因为在我的情况下每个存储桶的大小可能不同。
编辑: 经过所有讨论,我有了内部/外部循环解决方案,但我仍然想用字典消除内部循环,以获得 O(n) 性能,在这种情况下,如果我理解正确,我需要将输入值哈希到存储桶索引中。那么我们需要某种复杂度为 O(1) 的哈希函数?有什么想法如何去做吗?
I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.
Is it good idea to use bucket search for that?
Actually I found that question Calculating frequency distribution of a collection with .Net/C#
But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.
EDIT:
After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
桶排序已经是 O(n^2) 最坏情况,所以我在这里只做一个简单的内/外循环。由于您的存储桶数组必然比输入数组短,因此请将其保留在内部循环中。由于您使用的是自定义存储桶大小,因此实际上没有任何数学技巧可以消除该内部循环。
这也是 O(n^2) 最坏情况,但你无法超越代码的简单性。在优化成为真正的问题之前,我不会担心它。如果您有更大的存储桶数组,则可以使用某种二分搜索。但是,由于频率分布通常<1。 100 个元素,我怀疑您是否会看到很多现实世界的性能优势。
Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.
It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.
如果您的输入数组代表真实世界的数据(及其模式),并且边界数组很大,无法在内部循环中一次又一次地迭代它,您可以考虑以下方法:
首先对输入数组进行排序。如果您使用真实世界的数据
我建议考虑 Timsort - Wiki 。它
为可以在其中看到的模式提供非常好的性能保证
真实世界的数据。
遍历排序数组并将其与边界数组中的第一个值进行比较:
在代码中它可以如下所示:
If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:
First of all sort your input array. If you work with real-world data
I would recommend to consider Timsort - Wiki for this. It
provides very good performance guarantees for a patterns that can be seen in
real-world data.
Traverse through sorted array and compare it with the first value in the array of boundaries:
In a code it can look like this: