流数据的直方图近似
这个问题是此处回答的问题的轻微扩展< /a>.我正在努力重新实现 这篇论文,我想在再次开始这个过程之前先把所有的事情都安排好。上次,我使用了boost::multi_index,但性能并不是最好的,我想避免std::set的存储桶插入/查找复杂性的对数。由于我使用的直方图数量(随机森林中随机树的每个叶节点每个类每个特征一个),计算复杂度必须尽可能接近常数。
用于实现直方图的标准技术涉及将输入实数值映射到仓号。为了实现这一点,一种方法是:
- 初始化一个大小为 N 的标准 C 数组,其中 N = bin 数量;将
- 输入值(实数)乘以某个因子并对结果进行取整以获得其在 C 数组中的索引。
这对于具有统一箱大小的直方图非常有效,并且非常有效。然而,上述链接论文的 2.1 节提供了一种没有统一 bin 大小的直方图算法。
另一个问题是,简单地将输入实际值乘以一个因子并将所得乘积用作索引会因负数而失败。为了解决这个问题,我考虑在数组中的某个位置识别一个“0”容器。该 bin 将以 0.0 为中心;可以使用刚刚解释的相同的乘法和下限方法来计算其上方/下方的箱,稍加修改即可根据需要将下限乘积添加到两个或从两个中减去。
这就提出了合并的问题:论文中的算法合并了从中心到中心测量的两个最近的容器。在实践中,这会创建一个“锯齿状”直方图近似值,因为某些箱具有非常大的计数,而其他箱则不会。当然,这是由于尺寸不均匀造成的,并且不会导致任何精度损失。然而,如果我们尝试标准化大小不均匀的容器以使其均匀,就会出现精度损失。这是因为假设 m/2 个样本落在 bin 中心的左侧和右侧,其中 m = bin 计数。我们可以将每个箱建模为高斯,但这仍然会导致精度损失(尽管很小)
所以这就是我现在陷入困境的地方,导致这个主要问题:实现直方图的最佳方法是什么接受流数据并将每个样本存储在统一大小的容器中?
This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index
, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set
. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.
A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:
- initialize a standard C array of size N, where N = number of bins; and
- multiply the input value (real number) by some factor and floor the result to get its index in the C array.
This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.
Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.
This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)
So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
保留四个变量。
当新样本
x
到达时,计算double i = Floor(x - lower_bound) / bin_size
。如果i >= 0 &&我< N
,然后递增count[i]
。如果i >= N
,则重复将bin_size
加倍,直到x - lower_bound
N * bin_size 。每次加倍时,调整计数(通过利用多次加倍的稀疏性来优化计数)。
情况
i < 0
比较棘手,因为我们需要减少lower_bound
并增加bin_size
(同样,优化稀疏性或一步调整计数)。特殊情况的成本很高,但在数据范围内仅发生对初始 bin 大小的对数次数。
如果您以浮点实现此功能,请注意浮点数不是实数,并且
lower_bound -= N * bin_size
之类的语句可能会出现错误(在本例中,ifN * bin_size
比lower_bound
小得多)。我建议 bin_size 始终为基数(通常为 2)的幂。Keep four variables.
When a new sample
x
arrives, computedouble i = floor(x - lower_bound) / bin_size
. Ifi >= 0 && i < N
, then incrementcount[i]
. Ifi >= N
, then repeatedly doublebin_size
untilx - lower_bound < N * bin_size
. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).The case
i < 0
is trickier, since we need to decreaselower_bound
as well as increasebin_size
(again, optimize for sparsity or adjust the counts in one step).The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.
If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like
lower_bound -= N * bin_size
may misbehave (in this case, ifN * bin_size
is much smaller thanlower_bound
). I recommend thatbin_size
be a power of the radix (usually two) at all times.