大字节数组的准确运行统计平均值

发布于 2024-09-16 22:33:14 字数 862 浏览 14 评论 0原文

我有一个二维字节数组，如下所示：

int n = 100000;
int d = 128;
byte[][] samples = new byte[n][d]
/* proceed to fill samples with some delicious data */
byte[] mean = new byte[d];
findMean(mean,samples);

我的 findMean 函数继续填充平均值：

mean[k] = mean(samples[:][k])

到目前为止足够简单。问题是，由于溢出问题，这个均值函数不能简单地进行求和和除法。所以我当前的尝试是计算一个运行平均值，其主力看起来像这样：

for(int i = 0; i < samples.length; i++){
    byte diff = samples[i][k] - mean[k]
    mean[k] = (byte)((double)mean[k] + (Math.round( (double) ( diff ) / (double) (i + 1) )))

现在这根本不起作用，每一轮精度损失都会导致平均值与正确值相差甚远，我已经验证了这一点1000 个随机样本的小集（因此可计算）。

另外，由于我首先试图通过使用字节数组来避免内存问题，所以分配一个大的代理浮点数组来计算真实平均值，然后再转换为字节是完全不可能的。

以块的形式加载这些数据是......嗯，这是可能的，但我正在考虑我的最终选择，无论如何，这肯定只是将问题转移到块大小？

无论如何，使用运行算法准确计算字节数组的平均值以避免溢出问题。这里有好的解决办法吗？

干杯

原文

I have a two dimensional array of bytes which looks like this:

int n = 100000;
int d = 128;
byte[][] samples = new byte[n][d]
/* proceed to fill samples with some delicious data */
byte[] mean = new byte[d];
findMean(mean,samples);

My findMean function proceeds to fill mean such that:

mean[k] = mean(samples[:][k])

Simple enough so far. The issue is, due to overflow issues this mean function cannot simply make a sum and divide. So my current attempt is to calculate a running mean, the workhorse of which looks something like this:

for(int i = 0; i < samples.length; i++){
    byte diff = samples[i][k] - mean[k]
    mean[k] = (byte)((double)mean[k] + (Math.round( (double) ( diff ) / (double) (i + 1) )))

Now this doesn't work at all, every round the precision loss results in the mean being quite far off the correct value, which i have verified on small (therefore calculable) sets of 1000 random samples.

Also, due to the very memory issues which i'm trying to avoid by using byte arrays in the first place, it is quite impossible to allocate a large proxy float array to calculate the true mean, and later cast to a byte.

Loading this data in chunks is... well it's possible but i'm considering that my final alternative, and anyway, that surly just displaces the problem to the chunk size?

Anyway, accurate calculate of the mean for an array of bytes using a running algorithm to avoid overflow issues. Is there a good solution here?

Cheers

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

合约呢 2024-09-23 22:33:15

正确的。所以我决定至少必须持有一个 double 来计算任何给定维度的平均值。

问题是我正在通过以下方式解决这个问题：

for each sample, get the array it is to update
    for each dimension in that array, calculate it's running mean given the new sample

问题是必须使用 double[][] 来保存要更新的每个元素的每个维度的当前运行平均值。因此，我现在重新安排了我的循环，使其看起来更像这样：

for each array to be updated
    for each sample that will update this array
        for each dimension in the array to be updated calculate the running mean

这种方式需要一些预处理，我需要循环遍历所有样本以查找哪些样本将更新哪些数组（单个不定数数组），但我的总体节省是我现在可以保存一个 SINGLE double ，它会针对每个样本进行更新，该样本会更新该样本的给定维度的给定数组。

然后可以将该双精度类型转换为适当的低精度类型，在我的例子中是一个字节。

我最初想要的存储空间总体节省是：

用字节（成本 1*128*numberOfSamples）替换整数（成本 4*128*numberOfSamples），

这不起作用，但我现在制定了一个解决方案，成本类似于：（128*numberOfSamples + numberOfSamples）。节省 127*numberOfSamples。在我最坏的情况下，RAM 接近 15Gb :-)

所以，是的，我们就这样，睡了一晚，我回答了我自己的问题。

感谢各位朋友的帮助！

Right. So I decided that I would have to hold a double at least to calculate any given dimension's mean.

The issue was that i was approaching this problem by going:

for each sample, get the array it is to update
    for each dimension in that array, calculate it's running mean given the new sample

The issue with that is that a double[][] would have to be held holding the current running mean for each dimension of each element to update. Therefore, i have now rearranged my loop to look more like this:

for each array to be updated
    for each sample that will update this array
        for each dimension in the array to be updated calculate the running mean

This way round requires some pre-processing, i need to loop through all the samples to find which samples will update which arrays (a single indecies array) but my overall saving is that i can now hold a SINGLE double which is updated for each sample which updates a given array for a given dimension of that sample.

This double can then be cast to the appropriate low precision type, in my case, a byte.

The overall saving in terms of storage space i was initially going for was:

replace Integers (costing 4*128*numberOfSamples) with Bytes (costing 1*128*numberOfSamples)

that didn't work, but i have now formulated a solution that costs something like: (128*numberOfSamples + numberOfSamples). A saving of 127*numberOfSamples. Which in my worst case is something approaching 15Gb of RAM :-)

So yeah, there we go, a nights sleep and i answered my own question.

Thanks for the help fellows!

回复收藏 0 原文