为什么Bfloat16有这么多指数位?
很明显,为什么已经开始看到16位浮点格式开始使用机器学习的用途;它降低了存储和计算的成本,而神经网络对数字精度不敏感。
我发现特别令人惊讶的是,从业人员放弃了已经定义的半精度格式,而是为了将7位分配给显着的人,但向指数分配了8位,但向指数分配了8位 - 完全多达32位FP。 ( wikipedia> wikipedia>针对IEEE BINARY16和一些24位格式的布局。)
为什么要这么多指数位?到目前为止,我只发现 https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-to-to-to-to-to-high-performance-onformance-on-cloud-tpus
基于我们多年的经验培训并在Google的产品和服务中部署各种神经网络时,我们知道何时设计云TPU时,神经网络对指数的规模比Mantissa的大小更敏感。为了确保底层,溢出和NAN的相同行为,Bfloat16具有与FP32相同的指数大小。但是,Bfloat16的处理与fp32的处理方式不同:它使其齐平为零。与FP16不同,FP16通常需要通过诸如损耗缩放等技术进行特殊处理[MIC 17],BF16在训练和运行深神经网络时接近FP32的替换。
我没有在诸如Google量表之类的任何内容上进行神经网络实验,但是在我运行的情况下,绝对值的重量或激活大于1.0,这意味着它已经进入杂草,将变成无穷大,并且如果计算机会迅速使用错误消息崩溃,那将为您带来帮助。我从未见过或听说过任何需要动态范围的情况,例如单精度浮点的1E38。
那我想念什么?
神经网络确实需要巨大的动态范围吗?如果是这样,如何,为什么?
是否有一些理由认为,即使显着性要小得多,BFLOAT16使用与单个精度相同的指数非常有益?
或者,真正的目标是将显着的最低限度缩小到最低限度,以最大程度地减少乘数的芯片面积和能源成本,这是FPU中最昂贵的部分;碰巧的事实证明大约有7位。出于对齐原因,总尺寸应为2个功率;它不太适合8位;达到16个,左剩余部分,也可能用于某些东西,最优雅的解决方案是保持8位指数?
It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive to numeric precision.
What I find particularly surprising is that practitioners abandoned the already-defined half-precision format in favor of one that allocates only 7 bits to the significand, but 8 bits to the exponent – fully as many as 32-bit FP. (wikipedia compares brain-float bfloat16
layout against IEEE binary16 and some 24-bit formats.)
Why so many exponent bits? So far, I have only found https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Based on our years of experience training and deploying a wide variety of neural networks across Google’s products and services, we knew when we designed Cloud TPUs that neural networks are far more sensitive to the size of the exponent than that of the mantissa. To ensure identical behavior for underflows, overflows, and NaNs, bfloat16 has the same exponent size as FP32. However, bfloat16 handles denormals differently from FP32: it flushes them to zero. Unlike FP16, which typically requires special handling via techniques such as loss scaling [Mic 17], BF16 comes close to being a drop-in replacement for FP32 when training and running deep neural networks.
I haven't run neural network experiments on anything like Google scale, but in such as I have run, a weight or activation with absolute value much greater than 1.0 means it's gone into the weeds, is going to spiral off into infinity, and the computer would be doing you a favor if it were to promptly crash with an error message. I have never seen or heard of any case that needs a dynamic range anything like the 1e38 of single-precision floating point.
So what am I missing?
Are there cases where neural networks really need huge dynamic range? If so, how, why?
Is there some reason why it is considered very beneficial for bfloat16 to use the same exponent as single precision, even though the significand is much smaller?
Or is it the case that the real goal was to shrink the significand to the absolute minimum that would do the job, in order to minimize the chip area and energy cost of the multipliers, being the most expensive part of an FPU; it so happened this turned out to be around 7 bits; the total size should be a power of 2 for alignment reasons; it would not quite fit in 8 bits; going up to 16, left surplus bits that might as well be used for something, and the most elegant solution was to keep the 8-bit exponent?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从评论中收集一些讨论:
Collecting some of the discussion from the comments: