在 C# 中计算数组频率分布的最快方法是什么？

发布于 2024-12-02 02:36:55 字数 463 浏览 2 评论 0原文

我只是想知道该计算的最佳方法是什么。假设我有一个输入值数组和边界数组 - 我想计算/分桶边界数组中每个段的频率分布。

使用桶搜索是个好主意吗？

实际上我发现了这个问题 CalculateFrequency Distribution of a Collection with .Net /C#

但我不明白如何为此目的使用存储桶，因为在我的情况下每个存储桶的大小可能不同。

编辑：经过所有讨论，我有了内部/外部循环解决方案，但我仍然想用字典消除内部循环，以获得 O(n) 性能，在这种情况下，如果我理解正确，我需要将输入值哈希到存储桶索引中。那么我们需要某种复杂度为 O(1) 的哈希函数？有什么想法如何去做吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半世蒼涼 2024-12-09 02:36:55

桶排序已经是 O(n^2) 最坏情况，所以我在这里只做一个简单的内/外循环。由于您的存储桶数组必然比输入数组短，因此请将其保留在内部循环中。由于您使用的是自定义存储桶大小，因此实际上没有任何数学技巧可以消除该内部循环。

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

这也是 O(n^2) 最坏情况，但你无法超越代码的简单性。在优化成为真正的问题之前，我不会担心它。如果您有更大的存储桶数组，则可以使用某种二分搜索。但是，由于频率分布通常＜1。 100 个元素，我怀疑您是否会看到很多现实世界的性能优势。

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

回复收藏 0 原文

一绘本一梦想 2024-12-09 02:36:55

如果您的输入数组代表真实世界的数据（及其模式），并且边界数组很大，无法在内部循环中一次又一次地迭代它，您可以考虑以下方法：

首先对输入数组进行排序。如果您使用真实世界的数据
我建议考虑 Timsort - Wiki 。它
为可以在其中看到的模式提供非常好的性能保证
真实世界的数据。
遍历排序数组并将其与边界数组中的第一个值进行比较：
- 如果输入数组中的值小于边界 - 增加该边界的频率计数器
- 如果输入数组中的值大于边界 - 转到边界数组中的下一个值并增加新边界的计数器。

在代码中它可以如下所示：

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:

First of all sort your input array. If you work with real-world data
I would recommend to consider Timsort - Wiki for this. It
provides very good performance guarantees for a patterns that can be seen in
real-world data.
Traverse through sorted array and compare it with the first value in the array of boundaries:
- If value in input array is less then boundary - increment frequency counter for this boundary
- If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.

In a code it can look like this:

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}

回复收藏 0 原文

~没有更多了~