概率和神经网络
在神经网络中直接使用 sigmoid 或 tanh 输出层来估计概率是一个好习惯吗?
即给定输入发生的概率是 NN 中 sigmoid 函数的输出
编辑
我想使用神经网络来学习和预测给定输入发生的概率。 您可以将输入视为 State1-Action-State2 元组。 因此,NN 的输出是在 State1 上应用 Action 时 State2 发生的概率。
我希望这能澄清事情。
编辑
在训练 NN 时,我对 State1 进行随机操作并观察结果 State2;然后教导 NN 输入 State1-Action-State2 应导致输出 1.0
Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,传统 MLP 词典中的几个小点(可能有助于互联网搜索等):“sigmoid”和“tanh”不是“输出层”而是函数,通常称为“激活函数”。激活函数的返回值确实是每一层的输出,但它们不是输出层本身(也不计算概率)。
此外,您的问题引用了两个“替代项”(“sigmoid 和 tanh”)之间的选择,但它们实际上并不是替代项,而是术语“sigmoidal 函数”是一类函数的通用/非正式术语,包括您提到的双曲正切(“tanh”)。
术语“sigmoidal”可能是由于函数的特征形状所致——返回 (y) 值被限制在两个渐近值之间,而与 x 值无关。函数输出通常被归一化,以便这两个值为 -1 和 1(或 0 和 1)。 (顺便说一句,这种输出行为显然是受到生物神经元的启发,生物神经元要么激发(+1),要么不激发(-1))。看看 sigmoidal 函数的关键属性,您就会明白为什么它们非常适合作为前馈、反向传播神经网络中的激活函数:(i) 实值且可微分,(ii) 恰好有一个拐点,并且 ( iii) 有一对水平渐近线。
反过来,sigmoidal 函数是在使用反向传播求解的 FF 神经网络中用作激活函数(又名“挤压函数”)的一类函数。在训练或预测期间,输入的加权和(对于给定层,一次一层)作为参数传递给激活函数,该函数返回该层的输出。显然用作激活函数的另一组函数是分段线性函数。阶跃函数是 PLF 的二元变体:(
从实际角度来看,我怀疑阶跃函数是否是激活函数的合理选择,但也许它有助于理解激活函数在神经网络操作中的目的。)
我想有一个可能的激活函数数量不受限制,但在实践中,您只能看到少数几个;事实上,绝大多数情况下只有两种(都是 S 形)。它们在这里(在 python 中),因此您可以自己进行实验,因为主要选择标准是实用的:
选择激活函数时需要考虑哪些因素?
首先,函数必须给出所需的行为(由 S 形形状产生或由 S 形形状证明)。其次,函数必须是可微的。这是反向传播的要求,反向传播是训练期间用于“填充”隐藏层值的优化技术。
例如,双曲正切的导数是(就输出而言,通常是这样写的):
除了这两个要求之外,一个函数之所以比另一个函数重要的是它训练网络的效率——即,一个会在最少的时期内导致收敛(达到局部最小误差)?
#-------- 编辑(参见下面OP的评论)---------#
我不太确定我理解了——有时,如果没有代码,很难传达神经网络的细节,所以我可能应该说,只要遵守这个附带条件就可以了:您希望神经网络预测的内容必须与训练期间使用的因变量相同。例如,如果您使用两种状态(例如 0、1)作为单个因变量(这显然在您的测试/生产数据中缺失)来训练您的神经网络,那么这就是您的神经网络在“预测模式”下运行时将返回的结果(训练后,或使用有效的权重矩阵)。
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
您应该选择正确的损失函数来最小化。
这里平方误差不会导致最大似然假设。
平方误差源自具有高斯噪声的模型:
您可以直接估计概率。你的模型是:
P(Y=1|x,h) 是看到 x 后事件 Y=1 发生的概率。
您的模型的最大似然假设是:
这导致“交叉熵”损失函数。
请参阅Mitchell 的机器学习中的第 6 章
损失函数及其推导。
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
这种方法有一个问题:如果您有来自 R^n 的向量,并且您的网络将这些向量映射到区间 [0, 1] 中,则无法保证网络表示有效的概率密度函数,因为网络不保证等于 1。
例如,神经网络可以将任何输入形式 R^n 映射到 1.0。但这显然是不可能的。
所以你的问题的答案是:不,你不能。
但是,您可以说您的网络永远不会看到“不切实际”的代码示例,从而忽略这一事实。有关此问题的讨论(以及有关如何使用神经网络对 PDF 进行建模的一些更酷的信息),请参阅 对比反向传播。
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.