试图理解哈钦森对角线黑森的近似

发布于 2025-01-25 04:30:34 字数 2429 浏览 3 评论 0 原文

我正在阅读他的论文[1],并且我有一个从。在代码的某个时候,Hessian矩阵的对角线通过函数 set_hessian 近似您可以在下面找到。在 set_hessian()的末尾,提到#近似z*(h@z)的预期值。但是,当我打印 p.hess 时,我会得到

tensor([[[[ 2.3836e+01,  1.4929e+01,  4.1799e+00],
          [-1.6726e+01,  6.3954e+00, -5.1418e+00],
          [ 2.2580e+01, -1.1916e+01, -2.5049e+00]],
         [[-1.8261e+01,  8.7626e+00,  1.8244e+00],
          [-1.0819e+01, -2.9184e-01,  1.1601e+01],
          [-1.6267e+01,  5.6232e+00,  3.4282e+00]],
         ....
         [[-3.1088e+01,  4.3013e+01, -4.2021e+01],
          [ 1.5338e+01, -2.9806e+01, -3.0049e+01],
          [-9.8979e+00, -2.2835e+00, -6.0549e+00]]]], device='cuda:0')

p.hess 如何被认为是Hessian的对角线近似?我试图理解这种结构的原因是因为我希望获得最小的特征值,对角矩阵的倒数,以及Hessian和Hessian与梯度之间的产物。我们知道,对角线基质的最小特征值是对角线的最小元素,而对角线基质的倒数可以通过反转对角线的元素来计算。您能请某人对 P.Hess 的结构投下一些启示?

    @torch.no_grad()
    def set_hessian(self):
        """
        Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
        """

        params = []
        for p in filter(lambda p: p.grad is not None, self.get_params()):
            if self.state[p]["hessian step"] % self.update_each == 0:  # compute the trace only each `update_each` step
                params.append(p)
            self.state[p]["hessian step"] += 1

        if len(params) == 0:
            return

        if self.generator.device != params[0].device:  # hackish way of casting the generator to the right device
            self.generator = torch.Generator(params[0].device).manual_seed(2147483647)

        grads = [p.grad for p in params]

        for i in range(self.n_samples):
            zs = [torch.randint(0, 2, p.size(), generator=self.generator, device=p.device,
                                dtype=torch.float32) * 2.0 - 1.0 for p in params]  # Rademacher distribution {-1.0, 1.0}
            h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True,
                                       retain_graph=i < self.n_samples - 1)
            for h_z, z, p in zip(h_zs, zs, params):
                p.hess += h_z * z / self.n_samples  # approximate the expected values of z*(H@z)

[1] Adahessian:机器学习的自适应二阶优化器

I am reading about his paper [1] and I have an implementation taken from here. At some point of the code the diagonal of the Hessian matrix is approximated by a function set_hessian you can find below. In the end of set_hessian(), it is mentioned that # approximate the expected values of z*(H@z). However, when I print p.hess I get

tensor([[[[ 2.3836e+01,  1.4929e+01,  4.1799e+00],
          [-1.6726e+01,  6.3954e+00, -5.1418e+00],
          [ 2.2580e+01, -1.1916e+01, -2.5049e+00]],
         [[-1.8261e+01,  8.7626e+00,  1.8244e+00],
          [-1.0819e+01, -2.9184e-01,  1.1601e+01],
          [-1.6267e+01,  5.6232e+00,  3.4282e+00]],
         ....
         [[-3.1088e+01,  4.3013e+01, -4.2021e+01],
          [ 1.5338e+01, -2.9806e+01, -3.0049e+01],
          [-9.8979e+00, -2.2835e+00, -6.0549e+00]]]], device='cuda:0')

How p.hess is considered a diagonal approximation of the Hessian? The reason I am trying to understand this structure is because I want get the smallest eigenvalue, the inverse of the diagonal matrix, and the product between the Hessian and the gradient which is a vector. We know that the smallest eigenvalue of a diagonal matrix is the smallest element of the diagonal, while the inverse of a diagonal matrix can be computed by inverting the elements of the diagonal. Could you please someone cast some light on the structure of p.hess?

    @torch.no_grad()
    def set_hessian(self):
        """
        Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
        """

        params = []
        for p in filter(lambda p: p.grad is not None, self.get_params()):
            if self.state[p]["hessian step"] % self.update_each == 0:  # compute the trace only each `update_each` step
                params.append(p)
            self.state[p]["hessian step"] += 1

        if len(params) == 0:
            return

        if self.generator.device != params[0].device:  # hackish way of casting the generator to the right device
            self.generator = torch.Generator(params[0].device).manual_seed(2147483647)

        grads = [p.grad for p in params]

        for i in range(self.n_samples):
            zs = [torch.randint(0, 2, p.size(), generator=self.generator, device=p.device,
                                dtype=torch.float32) * 2.0 - 1.0 for p in params]  # Rademacher distribution {-1.0, 1.0}
            h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True,
                                       retain_graph=i < self.n_samples - 1)
            for h_z, z, p in zip(h_zs, zs, params):
                p.hess += h_z * z / self.n_samples  # approximate the expected values of z*(H@z)

[1] ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无力看清 2025-02-01 04:30:34

Hutchinson向您提供了Hessian基质的痕迹,而不是Hessian矩阵的对角线。

Hutchinson gives to you a approximation of Trace of hessian matrix, not a diagonal of Hessian matrix.

在风中等你 2025-02-01 04:30:34

这里的一个子问题之一是“ Hessian和[...]矢量之间的产品”。可以使用“ Hessian-vector产品”方法准确有效地计算这一点。无需Hutchinson或近似值。

One of the subquestions here was "the product between the Hessian and [...] a vector". This can be computed exactly and efficiently, using the "Hessian-vector product" method. No need for Hutchinson or approximations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文