搜索快速/高效的直方图算法(具有预先指定的箱)

发布于 2024-10-08 20:57:57 字数 310 浏览 0 评论 0原文

我在 Matlab 之外没有做太多编码,但我需要将我的 Matlab 代码导出到另一种语言,最有可能是 C。我的 Matlab 代码包含一个直方图函数 histc(),它放置我的输入数据(它是双精度) -精度,非整数)到指定的 bin 数组中,以形成直方图。

我确信我可以将几个嵌套循环拼凑在一起来生成直方图函数,但我需要这个函数快速且占用内存少,因为它将被重复且频繁地访问。

为了避免重新发明轮子,有人知道C语言是否有可用的现有直方图函数,或者需要这样的东西的人通常自己创建它吗?

有人知道创建直方图的有效算法吗?伪代码没问题。

提前致谢。

I don't do much coding outside of Matlab, but I have a need to export my Matlab code to another language, most likely C. My Matlab code includes a histogram function, histc(), that places my input data (which is double-precision, not integer) into a specified array of bins, to form a histogram.

I'm sure I can piece together a couple nested loops to generate a histogram function, but I need this function to be fast and memory-light, as it will be accessed repeatedly and often.

To avoid re-inventing the wheel, anyone know if C language has any existing histogram function(s) available for use, or whether people needing such a thing generally create it themselves?

Anyone know an efficient algorithm for creating a histogram? Pseudo-code is fine.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

静水深流 2024-10-15 20:57:57

“理想”直方图算法将取决于您期望捕获的范围。一般来说,任何直方图算法都将如下所示:

const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
    counts[TRANSFER(samples[i])]++;
}

其中 TRANSFER() 是将输入映射到 bin 的函数(第 0 个或第 N 个 bin 映射到适用的“超出范围”)。

TRANSFER() 的确切实现在很大程度上取决于样本的预期分布以及您感兴趣的细节。我见过的一些常见方法:

  • [a,b] 范围内的均匀分布(需要线性变换)
  • 无符号整数值的对数分布(与一些 位旋转黑客来快速确定最接近的二次幂或类似的)。

如果您事先不知道分布,那么您确实无法拥有有效的机制来有效地对它们进行分类:您要么必须猜测(有偏见或无信息的结果),要么存储所有内容并在最后对其进行排序,分箱到相同大小的桶中(性能较差)。

The "ideal" histogram algorithm will depend upon the range you expect to capture. Generally any histogram algorithm will look like this:

const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
    counts[TRANSFER(samples[i])]++;
}

where TRANSFER() is some function that maps your inputs to a bin (0th or Nth bin mapping to "out of range" of applicable).

The exact implementation of TRANSFER() depends a lot on the expected distribution of your sample and where you are interested in detail. Some common approaches I have seen:

  • uniform distribution in range [a,b] (requires linear transform)
  • logarithmic distribution of unsigned integer values (best when combined with some bit twiddling hacks to quickly determine the nearest power-of-two or similar).

If you don't know the distribution up-front, then you really can't have an efficient mechanism to bin them effectively: you'll either have to guess (biased or uninformative results) or store everything and sort it at the end, binning into equal-sized buckets (poor performance).

黑凤梨 2024-10-15 20:57:57

GSL (GNU Scientific Library) contains a histogram implementation.

Here is the documentation: http://www.gnu.org/software/gsl/manual/html_node/Histograms.html.

And here is an example use: http://www.gnu.org/software/gsl/manual/html_node/Example-programs-for-histograms.html.

听,心雨的声音 2024-10-15 20:57:57

我用 C 语言编写了自己的直方图代码,因为它非常简单,我什至没有想过要寻找一个库。通常,您只需要创建一个数组来包含所需的 bin 数量 [num_bins = (int)(max_val - min_val + 1);],当您遇到每个样本时,您可以除以箱数 [bin_idx = (int)((value - min_val) / bin_width);](其中 bin_width = (max_val - min_val)/num_bins)来查找位置它属于,然后递增 bin 计数器。这是一次简单、快速、单次传递数据的过程。请检查我上面的算术是否有边缘情况。

您可能遇到的问题是您的输入的域可能未知。如果您的所有数据仅占其中的一小部分,那么在整个 double 范围内拥有 100 个 bin 并不会带来多大好处。解决方案是首先遍历数据以找到范围的最小值/最大值。确实没有快速解决这个问题的方法,大多数库都会预先要求最小/最大。

I've written my own histogram code in C, as it's simple enough that I didn't even think to look for a library. Normally you just need to create an array to contain the number of bins that you want [num_bins = (int)(max_val - min_val + 1);], and as you encounter each sample you can divide by the number of bins [bin_idx = (int)((value - min_val) / bin_width);] (where bin_width = (max_val - min_val)/num_bins) to find where it belongs and then increment the bin counter. This is an easy, fast, single pass through the data. Do check my arithmetic above for edge cases.

The problem you might encounter is that the domain of your input might not be known. Having 100 bins over the whole range of double isn't going to be much good if all your data is within only a small fraction of that. The solution is to make a first pass over the data to find the min/max of your range. There's really no quick fix to this and most libraries will ask for min/max up front.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文