如何处理神经网络中的不连续输入分布
我正在使用Keras设置神经网络。 作为输入数据,我使用的向量可以在其中每个坐标为0(不存在或不测量功能)或一个可以在5000到10000之间范围的值。
因此,我的输入值分布是一种高斯中心的一种,让我们说我们说。大约在7500加一个非常细的峰值,在0。
我无法在其某些坐标中以0删除向量,因为几乎所有的位置都将有大约0个坐标。
因此,我的问题是:“如何最好地使输入向量正常化?”。我看到了两种可能性:
- 只需划出平均值,并按标准偏差分开。问题在于,平均值偏向于毫无意义的0s,而STD被高估了,从而消除了有意义的测量的细微变化。
- 计算非均方体坐标的平均值和标准偏差,这更有意义。但是,与非测量数据相对应的所有0个值都将带有高(负)值,这对毫无意义的数据具有一定的重要性...
有人对如何进行有建议?
谢谢 !
I am using Keras to setup neural networks.
As input data, I use vectors in which each coordinate can be either 0 (feature not present or not measured) or a value that can range for instance between 5000 and 10000.
So my input value distribution is a kind of gaussian centered let us say around 7500 plus a very thin peak at 0.
I cannot remove the vectors with 0 in some of their coordinates because almost all of them will have some 0s at some locations.
So my question is : "how to best normalize the input vectors ?". I see two possibilities :
- just substract the mean and divide by standard deviation. The problem then is that the mean is biased by the high number of meaningless 0s, and the std is overestimated, which erases the fine changes in the meaningful measurement.
- compute the mean and standard deviation on the non-zeros coordinates, which is more meaningful. But then all the 0 values that correspond to non measured data will come out with high (negative) values which gives some importance to meaningless data...
Does someone have an advice on how to proceed ?
Thanks !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
相反,将您的功能表示为2维:
您可以将其视为编码额外功能,说“其他功能丢失”。将每个功能的这种方式范围都标准化,并且都保留了所有信息
Instead, represent your features as 2 dimensions:
You can think of this as encoding extra feature saying "the other feature is missing". This way scale of each feature is normalised, and all informatino preserved