如何将人工神经网络的输出转换为概率?

发布于 2024-08-06 16:46:22 字数 326 浏览 1 评论 0原文

我不久前读过有关神经网络的内容,并且了解 ANN(尤其是通过反向传播学习的多层感知器)如何学习将事件分类为真或假。

我认为有两种方法:

1)你得到一个输出神经元。它的值是> 0.5 事件可能为真,如果其值 <=0.5 则事件可能为假。

2)如果第一个输出神经元的值>,您将得到两个输出神经元。比第二个事件的值更可能是真实的,反之亦然。

在这些情况下,人工神经网络会告诉您事件是否可能为真或可能为假。它没有说明可能性有多大。

有没有办法将此值转换为某些赔率或直接从 ANN 中获取赔率。我想要得到类似“该事件有 84% 的概率为真”的输出

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.

I think there are two ways :

1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.

2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.

In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.

Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

宛菡 2024-08-13 16:46:22

一旦神经网络经过训练,例如。使用问题中提到的反向传播(反向传播逻辑以最小化误差函数的方式“推动”权重)与所有单独输入(“外部”输入或神经网络内输入)相关的权重是固定的。然后,神经网络可用于分类目的。

学习阶段的数学(和“选项”)可能会变得有点厚,但当作为分类器运行时,它相对简单和直接。主要算法是计算每个神经元的激活值,作为该神经元的输入 x 权重的总和。然后将该值馈送到激活函数,其目的是将其标准化并将其转换为布尔值(在典型情况下,因为某些网络对其某些层没有全有或全无规则)。激活函数可能比您指出的更复杂,特别是它不需要是线性的,但无论其形状如何(通常是 sigmoid),它都以相同的方式运行:找出激活在曲线上的拟合位置,如果适用,则在上面或低于阈值。然后,基本算法会处理给定层的所有神经元,然后再进入下一层。

考虑到这一点,使用感知器的能力用百分比值来限定其猜测(或者实际上是猜测 - 复数)的问题找到了一个简单的答案:你打赌它可以,它的输出是实值的(如果有的话)需要标准化),然后使用问题中描述的激活函数和阈值/比较方法将其转换为离散值(在多个类别的情况下为布尔值或类别 ID)。

那么...如何以及在哪里获得“我的百分比”?...一切都取决于 NN 实现,更重要的是,实现规定了可用于将激活值引入 0-1 的标准化函数的类型范围,以所有百分比的总和“加起来”为 1 的方式。在最简单的形式中,激活函数可用于标准化输出层输入的值和权重可以用作确保“加起来”为 1 个问题的因素(前提是这些权重本身确实如此标准化)。

瞧!

澄清:(遵循马蒂厄的注释)
人们不需要改变神经网络本身的工作方式;唯一需要的是以某种方式“挂钩”输出神经元的逻辑,以访问它们计算的[实值]激活值,或者,可能更好,访问神经元的实值输出激活函数,在其布尔转换之前(通常基于阈值或某些随机函数)。

换句话说,神经网络像以前一样工作,它的训练和识别逻辑都没有改变,神经网络的输入保持不变,各层之间的连接也保持不变等。我们只得到了神经网络的实值激活的副本输出层中的神经元,我们用它来计算百分比。百分比计算的实际公式取决于激活值的性质及其相关函数(其规模、相对于其他神经元输出的范围等)。
下面是一些简单的情况(取自问题的建议输出规则)
1)如果有单个输出神经元:激活函数提供的值相对于该函数的范围的比率应该可以。
2)如果有两个(或更多输出神经元),例如分类器:如果所有输出神经元具有相同的激活函数,则给定神经元的百分比为其激活函数值除以所有激活函数之和价值观。如果激活函数发生变化,则情况会因具体情况而异,因为不同的激活函数可能表明有意向某些神经元赋予更多权重,并且百分比应尊重这一点。

Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.

Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.

With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.

So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).

Et voilà!

Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).

In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.

自此以后,行同陌路 2024-08-13 16:46:22

您可以做的是在输出层节点(接受数据范围)上使用 sigmoid 传递函数 (-inf,inf) 并输出 [-1,1] 中的值。
然后,通过使用 1-of-n 输出编码(每个类一个节点),您可以将范围 [-1,1] 映射到 [0,1] 并将其用作每个类别的概率类值(请注意,这自然适用于两个以上的类)。

What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).

孤云独去闲 2024-08-13 16:46:22

单个输出神经元的激活值是线性加权和,如果网络被训练为输出范围从 0 到 1,则可以直接解释为近似概率。如果传递函数 (或输出函数)在前一阶段并提供最终输出也在 0 到 1 范围内(通常是 sigmoidal 逻辑函数)。然而,不能保证一定会发生,但可以进行维修。此外,除非 sigmoid 是逻辑函数并且权重被限制为正且总和为 1,否则这是不可能的。一般来说,神经网络会使用 tanh sigmoid 以及正负范围内的权重和激活(由于该模型的对称性)以更平衡的方式进行训练。另一个因素是该类的流行度 - 如果为 50%,则 0.5 的阈值可能对逻辑有效,而 0.0 的阈值对 tanh 可能有效。 sigmoid 旨在将事物推向范围的中心(反向传播)并限制其超出范围(前馈)。性能的重要性(相对于伯努利分布)也可以解释为神经元做出真实预测而不是猜测的概率。理想情况下,预测变量对积极因素的偏差应该与现实世界中积极因素的普遍程度相匹配(这可能在不同的时间和地点有所不同,例如牛市与熊市,例如申请贷款的人与未能偿还贷款的人的信用价值) ) - 校准概率的优点是可以轻松设置任何所需的偏差。

如果您有两个类别的两个神经元,则每个神经元都可以如上所述独立解释,并且它们之间的差异也可以减半。这就像翻转负类神经元并求平均值。这些差异还可以产生显着性估计的概率(使用 T 检验)。

Brier 分数及其 Murphy 分解给出了对平均答案正确的概率的更直接估计,而 Informedness 给出了分类器做出明智决策而不是猜测的概率,ROC AUC 给出了正类将被排名的概率高于负类别(通过正预测变量),并且当流行率 = 偏差时,Kappa 将给出与知情度相匹配的类似数字。

您通常想要的是整个分类器的显着性概率(以确保您在真实的场地上比赛,而不是在虚构的估计框架中)和特定示例的概率估计。有多种校准方法,包括对概率进行回归(线性或非线性)以及使用其反函数重新映射到更准确的概率估计。这可以通过 Brier 分数的提高看出,校准分量减少到 0,但辨别分量保持不变,ROC AUC 和知情度也应如此(Kappa 会受到偏差影响,可能会恶化)。

校准概率的一种简单的非线性方法是使用 ROC 曲线 - 当单个神经元的输出的阈值发生变化或两个竞争神经元之间的差异时,我们在 ROC 曲线上绘制结果的真阳性率和假阳性率(假负率和真负率自然是互补的,因为不是真正的正值就是负值)。然后,您逐个样本(每次梯度变化时)扫描 ROC 曲线(折线),正样本的比例将为您提供与生成该点的神经阈值相对应的正样本的概率估计。曲线上点之间的值可以在校准集中表示的值之间进行线性插值 - 事实上,ROC 曲线中由去凸(凹痕)表示的任何坏点都可以通过凸包进行平滑 - 在船体段的端点。 Flach 和 Wu 提出了一种实际上翻转片段的技术,但这取决于以错误的方式使用信息,尽管它可以重复使用以对校准集进行任意改进,但它越来越不可能推广到测试情况。

(我来这里是为了寻找我很久以前见过的关于这些基于 ROC 的方法的论文 - 所以这是凭记忆而没有这些丢失的参考资料。)

The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.

If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).

The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.

What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).

A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.

(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)

再可℃爱ぅ一点好了 2024-08-13 16:46:22

我会非常谨慎地将神经网络(实际上是任何机器学习分类器)的输出解释为概率。机器被训练来区分类别,而不是估计概率密度。事实上,我们的数据中并没有这些信息,我们必须去推断。根据我的经验,我不建议任何人直接将输出解释为概率。

I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.

等待圉鍢 2024-08-13 16:46:22

你试过教授吗? Hinton 建议使用 softmax 激活函数和交叉熵误差来训练网络?

作为示例,创建一个具有以下内容的三层网络:

linear neurons   [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons   [ number of classes ]

然后使用您最喜欢的优化器随机下降/iprop plus/梯度下降,使用交叉熵误差 softmax 传输来训练它们。训练后,输出神经元应归一化为 1 之和。

请参阅 http://en.wikipedia.org/ wiki/Softmax_activation_function 了解详细信息。 Shark 机器学习框架确实通过组合两个模型来提供 Softmax 功能。还有教授。 Hinton 是一个优秀的在线课程 @ http://coursera.com 有关详细信息。

did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?

as an example create a three layer network with the following:

linear neurons   [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons   [ number of classes ]

then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.

Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course @ http://coursera.com regarding the details.

别把无礼当个性 2024-08-13 16:46:22

我记得我在神经计算理论简介(hertz krogh palmer)一书中看到过一个用反向传播训练神经网络来近似结果概率的例子。我认为这个例子的关键是一个特殊的学习规则,这样你就不必将单位的输出转换为概率,而是自动获得概率作为输出。
如果有机会的话,尝试看看那本书。

(顺便说一句,“玻尔兹曼机”虽然不太出名,但却是专门为学习概率分布而设计的神经网络,您可能也想检查一下)

I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.

(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)

梦初启 2024-08-13 16:46:22

当使用 ANN 进行 2 类分类并且在输出层中使用逻辑 sigmoid 激活函数时,输出值可以解释为概率。

因此,如果您在 2 个类别之间进行选择,则可以使用 1-of-C 编码进行训练,其中 2 个 ANN 输出将分别具有每个类别的训练值 (1,0) 和 (0,1)。

要获得第一类的概率(百分比),只需将第一个 ANN 输出乘以 100。要获得其他类的概率,请使用第二个输出。

这可以推广到使用 softmax 激活函数的多类分类。

您可以在此处阅读更多内容,包括概率解释的证明:

[1] Bishop,Christopher M. 用于模式识别的神经网络。牛津大学出版社,1995。

When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.

So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.

To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.

This could be generalized for multi-class classification using softmax activation function.

You can read more, including proofs of probabilistic interpretation here:

[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

吝吻 2024-08-13 16:46:22

有多个类的分类问题的人来说,这个视频帮助很大(而且它非常容易理解):

我知道这不是完全相同的问题,但对于我这个 com/watch?v=SFsc2P240rw&ab_channel=KapilSachdeva" rel="nofollow noreferrer">https://www.youtube.com/watch?v=SFsc2P240rw&ab_channel=KapilSachdeva

简而言之,您将输出通过幂求正。为此,您只需调用 np.exp() 即可。
然后,对输出进行标准化,使所有值的总和等于 1。
这可以通过将取幂输出除以取幂输出的总和来实现。

所以,步骤1:

exponentiated_output = np.exp(output)

步骤2:

probabilistic_output = exponentiated_output / np.sum(exponentiated_output, axis=0)

现在,输出加起来为1,为正,且值在0和1之间。因此,满足概率分布的要求。

在视频中,他还解释了温度缩放,以使概率差异较小。希望有帮助。

I know this is not exactly the same problem, but for me, who had a classification problem with multiple classes, this video helped a ton (and its really simple to understand):

https://www.youtube.com/watch?v=SFsc2P240rw&ab_channel=KapilSachdeva

In a nutshell, you make the output positive through exponentiation. To do this, you simply call np.exp().
Then, you normalize the outputs, so that the sum of all values adds up to 1.
This can be by dividing the exponentiated output by the sum of the exponentiated output.

So, step 1:

exponentiated_output = np.exp(output)

Step 2:

probabilistic_output = exponentiated_output / np.sum(exponentiated_output, axis=0)

Now, the output adds up to 1, is positive and the values are between 0 and 1. Therefore, the requirements for probability distribution is satisfied.

In the video, he also explains temperature scaling, so that the probabilities differentiate less. Hope that helps.

于我来说 2024-08-13 16:46:22

我不是数据科学家,所以我的回答可能不是很有用;然而,我面临着同样的问题,我的想法是在训练神经网络后实际计算数据验证子集的概率。计算应回答以下问题:“如果 NN 分数高于 X,成功分类的概率是多少?”或“如果 NN 分数在 [X, X+delta] 范围内,成功分类的概率是多少?”通过简单地计算跨越多个 X 值的案例,我们可以重建一个离散函数(表),将 NN 评分与验证数据集上的概率联系起来。你怎么认为?这在某种程度上是一种斯巴达式的方法,但结果是真实验证数据集上的真实概率。

I am not a data scientist, so my answer may not be very useful; however, I am facing the same question, and my idea is to actually calculate the probability on the validation subset of data once the neural network is trained. The calculation should answer the question, "What is the probability of successful classification if the NN score is higher than X?" or "What is the probability of successful classification if the NN score is within the range [X, X+delta]?" By simply calculating the cases spanning several X values, we can rebuild a discrete function (table) that links the NN scoring to the probabilities on a validation dataset. What do you think? It is somehow a spartan approach, but the rsult is a real probability on a real validation dataset.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文