反向模式AD和正向模式AD功能相同吗?

发布于 2025-01-09 07:53:48 字数 842 浏览 1 评论 0原文

你好,我对编程很熟悉,但对 Julia 很陌生,所以答案可能是显而易见的。

前向模式 AD 通常与神经网络的前向传播进行比较,反向模式 AD 与反向传播进行比较,显然你不能用前向传播代替反向传播。 正向和反向模式 AD 都计算梯度。但它们的功能是否相同,或者忽略效率,反向模式是否做了正向模式不做的事情?或者,是否存在使用反向模式的应用程序,在忽略效率的情况下,无法使用正向模式?

提出这个问题的原因是该论文 https://richarde.dev/papers/2022/ad/higher- order-ad.pdf 可证明正确、渐近有效、高阶反向模式自动微分 将反向模式 AD 定义为一种在存在大量输入时计算梯度的正确且有效的方法。然而,它们的定义计算与前向模式 AD 相同的函数。

非常感谢对我的理解的任何纠正。

说明:

正向算法表示采用 dx 并计算 df(x) 的算法,反向算法表示采用 df(x) 并计算可能的 dx 的算法。

最容易理解的自动微分算法是称为前向模式自动微分的前向算法,但当输入向量 (x) 较大时效率不高。因此,发明了反向算法,该算法对于大量输入向量非常有效,这些算法被称为反向模式自动微分。这些反向算法似乎更难理解和实现。

论文中引用了一种用于计算自动微分的 Haskel 前向算法,该算法对于大输入向量非常有效。由于效率高,这被称为反向模式自动微分并不是没有道理的。假设它们是正确的并且它们的算法可以为 Julia 程序实现,这是否意味着反向算法不再有用。或者是否有一些用例(不是自动微分)仍然需要逆向算法?

Hi I am old to programming but very new to Julia so the answer may be obvious.

Forward Mode AD is often compared with the forward pass of a neural net, Reverse Mode AD is compared with back propagation and clearly you cannot replace back propagation with a forward pass.
Forward and Reverse Mode AD both compute the gradient. But are they the same function or, ignoring efficiency, is Reverse Mode doing something the Forward Mode is not? Alternatively are there applications using Reverse Mode where, ignoring efficiency, Forward Mode could not be used?

The reason for asking this is that the paper
https://richarde.dev/papers/2022/ad/higher-order-ad.pdf Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation
defines Reverse Mode AD as a correct and efficient way to compute the gradient when there is a large number of inputs. Yet their definition computes the same function as Froward Mode AD.

Any correction of my understanding much appreciated.

Clarification:

Let Forward Algorithm means an algorithm that takes dx and computes df(x) and Reverse Algorithm mean an algorithm the takes df(x) and computes possible dx.

The easiest Automatic Differentiation algorithm to understand is a Forward Algorithm called Forward Mode Automatic Differentiation but is not efficient when there is a large vector of inputs (x). Hence Reverse Algorithms were invented that were efficient with a large vector of inputs these were called Reverse Mode Automatic Differentiation. These Reverse Algorithms appear much harder both to understand and to implement.

In the paper cited a Haskel forward algorithm for computing Automatic Differentiation was given that is efficient with a large vector of inputs. This, because of the efficiency, was not unreasonably called Reverse Mode Automatic Differentiation. On the assumption both that they are correct and that their algorithm can be implemented for Julia programs does it mean that the Reverse Algorithms are no longer useful. Or is there some use cases (not Automatic Differentiation) for which Reverse Algorithms are still needed?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

葮薆情 2025-01-16 07:53:48

这取决于“计算相同的函数”的含义。以正确的方式使用它,相同的数字(直到数字)将在最后出现 - 梯度。但涉及的操作有所不同。

AD 总是在给定点评估某个线性算子。在前向模式下,运算符是雅可比J,对于后向模式,它是它的伴随J'。但是,由于运算符仅采用算法形式,而不是矩阵形式,因此您不能仅“转置它”以从另一个中获取一个。相反,您可以对两者进行评估,但在不同的基础上,以恢复相同的梯度。

本质上,这是一个矩阵乘法的问题:您是为所有单热向量 v 计算 Jv,还是为标量 dJ' 计算 dJ' 吗? code>d = 1 一次获得整个渐变?前向模式的低效率来自于以下事实:R^n -> R 函数中,这些基数(one-hot 向量的数量)随 n 缩放。

所以:它们不是同一个函数,而是伴随函数,您可以从另一个结果中恢复其中一个结果。

抱歉,如果这有点密集,但更多的内容将成为介绍 AD 的讲座。 SO 可能不是合适的地方。不过,您应该花时间找到一些介绍性材料。

It depends on what you mean by "computes the same function". Using it the right way, the same numbers (up to numerics) will fall out at the end -- the gradient. The operations involved are different, though.

AD always evaluates a certain linear operator at a given point. In the case of forward mode, the operator is the Jacobian J, and for backward mode, it's its adjoint J'. But as you have the operator only in algorithmic form, not in matrix form, you cannot just "transpose it" to get one from the other. Instead, you can evaluate both, but in different bases, to recover the same gradient.

In essence, it's a question of matrix multiplication: do you compute Jv for all one-hot vectors v, or dJ' for the scalar d = 1 to get the whole gradient at once? The inefficiency of forward mode comes from the fact that in R^n -> R functions, these bases (the number of one-hot vectors) scale with n.

So: they are not the same function, but adjoint, and you can recover either result from the other.

Sorry if that's a bit dense, but anything more would become a lecture introducing AD. SO is probably not the right place for that. You should take the time to find some introductory material, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文