MultiHeadections在版本之间提供截然不同的值（Pytorch/Tensorflow

发布于 2025-01-22 03:44:32 字数 1006 浏览 4 评论 0 原文

我正在尝试重新创建用pytorch编写的变压器并使其张力。一切都进展顺利，直到每种版本的多项式开始提供极大的输出。这两种方法均为本文“您需要的关注就是您需要”的多头注意的实现，因此它们应该能够达到相同的输出。

我要转换

self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)

为

self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)

测试，辍学是0。

我打电话给它们：

self_attn(x,x,x)

其中x是一个张量=（10、128、50），

如 documentation ，pytorch版本返回元组，（目标序列长度，嵌入尺寸），均为dimensions [ 10、128、50]。

我很难让TensorFlow版本执行相同的操作。使用TensorFlow，我只能获得一个张量（尺寸[10，128，50]），并且看起来既不是pytorch的目标序列长度或嵌入尺寸张量。基于 tensorflow文档我应该得到可比的东西。

我如何让他们以相同的方式操作？我猜我在TensorFlow上做错了什么，但我不知道什么。

原文

I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.

I'm converting

self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)

self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)

For my tests, dropout is 0.

I'm calling them with:

self_attn(x,x,x)

where x is a tensor with shape=(10, 128, 50)

As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].

I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch.
Based on the Tensorflow documentation I should be getting something comparable.

How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

彻夜缠绵 2025-01-29 03:44:32

nn.multiheadateention 默认输出带有两个张量的元组：

attn_output - 自我注意事项的结果
attn_output_weights - 头部平均（！）

平均（！）

在同一时间 tf.keras.layers.melultiheadateention 输出输出默认一个tensor 代码> coade_output （对应于 attn_output of Pytorch）。如果参数 return_attention_scores 设置为 true ，则所有头部的注意权重也将返回

output, scores = self_attn(x, x, x, return_attention_scores=True)

。使用Pytorch：

scores = tf.math.reduce_mean(scores, 1)

在重写时要记住，默认情况下（如所讨论的摘要中，如码头） nn.multiheadateention 期望以形式输入（seq_length，batch_size，batch_size，embed_dim），但 tf.keras.layers.multiheadattention 期望它以形式（batch_size，seq_length，embed_dim）。

nn.MultiheadAttention outputs by default tuple with two tensors:

attn_output -- result of self-attention operation
attn_output_weights -- attention weights averaged(!) over heads

At the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:

output, scores = self_attn(x, x, x, return_attention_scores=True)

Tensor scores also should be averaged to achieve full correspondence with pytorch:

scores = tf.math.reduce_mean(scores, 1)

While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).