当前位置：文江博客话题详情

概率和神经网络

发布于 2024-08-31 14:21:34 字数 357 浏览 4 评论 0原文

在神经网络中直接使用 sigmoid 或 tanh 输出层来估计概率是一个好习惯吗？

即给定输入发生的概率是 NN 中 sigmoid 函数的输出

编辑
我想使用神经网络来学习和预测给定输入发生的概率。您可以将输入视为 State1-Action-State2 元组。因此，NN 的输出是在 State1 上应用 Action 时 State2 发生的概率。

我希望这能澄清事情。

编辑
在训练 NN 时，我对 State1 进行随机操作并观察结果 State2；然后教导 NN 输入 State1-Action-State2 应导致输出 1.0

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

转角预定愛 2024-09-07 14:21:34

首先，传统 MLP 词典中的几个小点（可能有助于互联网搜索等）：“sigmoid”和“tanh”不是“输出层”而是函数，通常称为“激活函数”。激活函数的返回值确实是每一层的输出，但它们不是输出层本身（也不计算概率）。

此外，您的问题引用了两个“替代项”（“sigmoid 和 tanh”）之间的选择，但它们实际上并不是替代项，而是术语“sigmoidal 函数”是一类函数的通用/非正式术语，包括您提到的双曲正切（“tanh”）。

术语“sigmoidal”可能是由于函数的特征形状所致——返回 (y) 值被限制在两个渐近值之间，而与 x 值无关。函数输出通常被归一化，以便这两个值为 -1 和 1（或 0 和 1）。（顺便说一句，这种输出行为显然是受到生物神经元的启发，生物神经元要么激发（+1），要么不激发（-1））。看看 sigmoidal 函数的关键属性，您就会明白为什么它们非常适合作为前馈、反向传播神经网络中的激活函数：(i) 实值且可微分，(ii) 恰好有一个拐点，并且 ( iii) 有一对水平渐近线。

反过来，sigmoidal 函数是在使用反向传播求解的 FF 神经网络中用作激活函数（又名“挤压函数”）的一类函数。在训练或预测期间，输入的加权和（对于给定层，一次一层）作为参数传递给激活函数，该函数返回该层的输出。显然用作激活函数的另一组函数是分段线性函数。阶跃函数是 PLF 的二元变体：（

def step_fn(x) :
  if x <= 0 :
    y = 0
  if x > 0 :
    y = 1

从实际角度来看，我怀疑阶跃函数是否是激活函数的合理选择，但也许它有助于理解激活函数在神经网络操作中的目的。）

我想有一个可能的激活函数数量不受限制，但在实践中，您只能看到少数几个；事实上，绝大多数情况下只有两种（都是 S 形）。它们在这里（在 python 中），因此您可以自己进行实验，因为主要选择标准是实用的：

# logistic function
def sigmoid2(x) :
  return 1 / (1 + e**(-x))   

# hyperbolic tangent
def sigmoid1(x) :
  return math.tanh(x)

选择激活函数时需要考虑哪些因素？

首先，函数必须给出所需的行为（由 S 形形状产生或由 S 形形状证明）。其次，函数必须是可微的。这是反向传播的要求，反向传播是训练期间用于“填充”隐藏层值的优化技术。

例如，双曲正切的导数是（就输出而言，通常是这样写的）：

def dsigmoid(y) :
  return 1.0 - y**2

除了这两个要求之外，一个函数之所以比另一个函数重要的是它训练网络的效率——即，一个会在最少的时期内导致收敛（达到局部最小误差）？

#-------- 编辑（参见下面OP的评论）---------#

我不太确定我理解了——有时，如果没有代码，很难传达神经网络的细节，所以我可能应该说，只要遵守这个附带条件就可以了：您希望神经网络预测的内容必须与训练期间使用的因变量相同。例如，如果您使用两种状态（例如 0、1）作为单个因变量（这显然在您的测试/生产数据中缺失）来训练您的神经网络，那么这就是您的神经网络在“预测模式”下运行时将返回的结果（训练后，或使用有效的权重矩阵）。

First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).

Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.

The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.

In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:

def step_fn(x) :
  if x <= 0 :
    y = 0
  if x > 0 :
    y = 1

(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)

I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:

# logistic function
def sigmoid2(x) :
  return 1 / (1 + e**(-x))   

# hyperbolic tangent
def sigmoid1(x) :
  return math.tanh(x)

what are the factors to consider in selecting an activation function?

First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.

For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :

def dsigmoid(y) :
  return 1.0 - y**2

Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?

#-------- Edit (see OP's comment below) ---------#

I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).

回复收藏 0 原文

晒暮凉 2024-09-07 14:21:34

您应该选择正确的损失函数来最小化。
这里平方误差不会导致最大似然假设。
平方误差源自具有高斯噪声的模型：

P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)

您可以直接估计概率。你的模型是：

P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)

P(Y=1|x,h) 是看到 x 后事件 Y=1 发生的概率。

您的模型的最大似然假设是：

h_max_likelihood = argmax_h product(
    h(x)**y * (1-h(x))**(1-y) for x, y in examples)

这导致“交叉熵”损失函数。
请参阅Mitchell 的机器学习中的第 6 章
损失函数及其推导。

You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:

P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)

You estimate the probabilities directly. Your model is:

P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)

P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.

The maximum likelihood hypothesis for your model is:

h_max_likelihood = argmax_h product(
    h(x)**y * (1-h(x))**(1-y) for x, y in examples)

This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.

回复收藏 0 原文

恰似旧人归 2024-09-07 14:21:34

这种方法有一个问题：如果您有来自 R^n 的向量，并且您的网络将这些向量映射到区间 [0, 1] 中，则无法保证网络表示有效的概率密度函数，因为网络不保证等于 1。

例如，神经网络可以将任何输入形式 R^n 映射到 1.0。但这显然是不可能的。

所以你的问题的答案是：不，你不能。

但是，您可以说您的网络永远不会看到“不切实际”的代码示例，从而忽略这一事实。有关此问题的讨论（以及有关如何使用神经网络对 PDF 进行建模的一些更酷的信息），请参阅对比反向传播。

回复收藏 0 原文

~没有更多了~

关于作者

苦笑流年记忆

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

概率和神经网络

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

概率和神经网络

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。