大字节数组的准确运行统计平均值
我有一个二维字节数组,如下所示:
int n = 100000;
int d = 128;
byte[][] samples = new byte[n][d]
/* proceed to fill samples with some delicious data */
byte[] mean = new byte[d];
findMean(mean,samples);
我的 findMean 函数继续填充平均值:
mean[k] = mean(samples[:][k])
到目前为止足够简单。问题是,由于溢出问题,这个均值函数不能简单地进行求和和除法。所以我当前的尝试是计算一个运行平均值,其主力看起来像这样:
for(int i = 0; i < samples.length; i++){
byte diff = samples[i][k] - mean[k]
mean[k] = (byte)((double)mean[k] + (Math.round( (double) ( diff ) / (double) (i + 1) )))
现在这根本不起作用,每一轮精度损失都会导致平均值与正确值相差甚远,我已经验证了这一点1000 个随机样本的小集(因此可计算)。
另外,由于我首先试图通过使用字节数组来避免内存问题,所以分配一个大的代理浮点数组来计算真实平均值,然后再转换为字节是完全不可能的。
以块的形式加载这些数据是......嗯,这是可能的,但我正在考虑我的最终选择,无论如何,这肯定只是将问题转移到块大小?
无论如何,使用运行算法准确计算字节数组的平均值以避免溢出问题。这里有好的解决办法吗?
干杯
I have a two dimensional array of bytes which looks like this:
int n = 100000;
int d = 128;
byte[][] samples = new byte[n][d]
/* proceed to fill samples with some delicious data */
byte[] mean = new byte[d];
findMean(mean,samples);
My findMean function proceeds to fill mean such that:
mean[k] = mean(samples[:][k])
Simple enough so far. The issue is, due to overflow issues this mean function cannot simply make a sum and divide. So my current attempt is to calculate a running mean, the workhorse of which looks something like this:
for(int i = 0; i < samples.length; i++){
byte diff = samples[i][k] - mean[k]
mean[k] = (byte)((double)mean[k] + (Math.round( (double) ( diff ) / (double) (i + 1) )))
Now this doesn't work at all, every round the precision loss results in the mean being quite far off the correct value, which i have verified on small (therefore calculable) sets of 1000 random samples.
Also, due to the very memory issues which i'm trying to avoid by using byte arrays in the first place, it is quite impossible to allocate a large proxy float array to calculate the true mean, and later cast to a byte.
Loading this data in chunks is... well it's possible but i'm considering that my final alternative, and anyway, that surly just displaces the problem to the chunk size?
Anyway, accurate calculate of the mean for an array of bytes using a running algorithm to avoid overflow issues. Is there a good solution here?
Cheers
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正确的。所以我决定至少必须持有一个 double 来计算任何给定维度的平均值。
问题是我正在通过以下方式解决这个问题:
问题是必须使用 double[][] 来保存要更新的每个元素的每个维度的当前运行平均值。因此,我现在重新安排了我的循环,使其看起来更像这样:
这种方式需要一些预处理,我需要循环遍历所有样本以查找哪些样本将更新哪些数组(单个不定数数组),但我的总体节省是我现在可以保存一个 SINGLE double ,它会针对每个样本进行更新,该样本会更新该样本的给定维度的给定数组。
然后可以将该双精度类型转换为适当的低精度类型,在我的例子中是一个字节。
我最初想要的存储空间总体节省是:
用字节(成本 1*128*numberOfSamples)替换整数(成本 4*128*numberOfSamples),
这不起作用,但我现在制定了一个解决方案,成本类似于:(128*numberOfSamples + numberOfSamples)。节省 127*numberOfSamples。在我最坏的情况下,RAM 接近 15Gb :-)
所以,是的,我们就这样,睡了一晚,我回答了我自己的问题。
感谢各位朋友的帮助!
Right. So I decided that I would have to hold a double at least to calculate any given dimension's mean.
The issue was that i was approaching this problem by going:
The issue with that is that a double[][] would have to be held holding the current running mean for each dimension of each element to update. Therefore, i have now rearranged my loop to look more like this:
This way round requires some pre-processing, i need to loop through all the samples to find which samples will update which arrays (a single indecies array) but my overall saving is that i can now hold a SINGLE double which is updated for each sample which updates a given array for a given dimension of that sample.
This double can then be cast to the appropriate low precision type, in my case, a byte.
The overall saving in terms of storage space i was initially going for was:
replace Integers (costing 4*128*numberOfSamples) with Bytes (costing 1*128*numberOfSamples)
that didn't work, but i have now formulated a solution that costs something like: (128*numberOfSamples + numberOfSamples). A saving of 127*numberOfSamples. Which in my worst case is something approaching 15Gb of RAM :-)
So yeah, there we go, a nights sleep and i answered my own question.
Thanks for the help fellows!
您可以使用更大尺寸的整数类型(long / bigInt),甚至任意精度算术,计算总和。在这种情况下,您实际上并不需要在线算法,尽管保留它除了使计算速度变慢之外不会产生任何影响。
当您将总和除以计数来计算平均值时,您当然会受到所使用的浮点类型的精度的限制,因此请记住这一点。如果你走 APA 路线,这将不是问题。
You could use a larger-size integer type (long / bigInt), or even arbitrary precision arithmetic, to calculate the sum. In this case, you don't really need the online algorithm, although keeping it wouldn't have any impact other than making the calculation slower.
When you do divide the sum by the count to calculate the mean, you will of course be restricted by the precision of the floating-point type you are using, so keep that in mind. If you go down the APA route, this will not be an issue.
如果您正在计算 128 个均值,您无法分配 128 个双精度数(例如 dmean[])来保存它们,请使用
double diff = Samples[i][k] - dmean[k];
dmean[k] = dmean[k] + diff/(i+1) ;
更新平均值?
If you are calculating 128 means could you not afford to allocate 128 doubles (dmean[] say) to hold them, use
double diff = samples[i][k] - dmean[k];
dmean[k] = dmean[k] + diff/(i+1) ;
to update the mean?