我正在尝试重新创建用pytorch编写的变压器并使其张力。一切都进展顺利,直到每种版本的多项式开始提供极大的输出。这两种方法均为本文“您需要的关注就是您需要”的多头注意的实现,因此它们应该能够达到相同的输出。
我要转换
self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)
为
self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)
测试,辍学是0。
我打电话给它们:
self_attn(x,x,x)
其中x是一个张量=(10、128、50),
如 documentation ,pytorch版本返回元组,(目标序列长度,嵌入尺寸),均为dimensions [ 10、128、50]。
我很难让TensorFlow版本执行相同的操作。使用TensorFlow,我只能获得一个张量(尺寸[10,128,50]),并且看起来既不是pytorch的目标序列长度或嵌入尺寸张量。
基于 tensorflow文档我应该得到可比的东西。
我如何让他们以相同的方式操作?我猜我在TensorFlow上做错了什么,但我不知道什么。
I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.
I'm converting
self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)
to
self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)
For my tests, dropout is 0.
I'm calling them with:
self_attn(x,x,x)
where x is a tensor with shape=(10, 128, 50)
As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].
I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch.
Based on the Tensorflow documentation I should be getting something comparable.
How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.
发布评论
评论(1)
nn.multiheadateention
默认输出带有两个张量的元组:attn_output
- 自我注意事项的结果attn_output_weights
- 头部平均(!)平均(!)
在同一时间
tf.keras.layers.melultiheadateention
输出输出默认一个tensor 代码> coade_output (对应于attn_output
of Pytorch)。如果参数return_attention_scores
设置为true
,则所有头部的注意权重也将返回。使用Pytorch:
在重写时要记住,默认情况下(如所讨论的摘要中,如码头)
nn.multiheadateention
期望以形式输入(seq_length,batch_size,batch_size,embed_dim)
,但tf.keras.layers.multiheadattention
期望它以形式(batch_size,seq_length,embed_dim)
。nn.MultiheadAttention
outputs by default tuple with two tensors:attn_output
-- result of self-attention operationattn_output_weights
-- attention weights averaged(!) over headsAt the same time
tf.keras.layers.MultiHeadAttention
outputs by default only one tensorattention_output
(which corresponds toattn_output
of pytorch). Attention weights of all heads also will be returned if parameterreturn_attention_scores
is set toTrue
, like:Tensor
scores
also should be averaged to achieve full correspondence with pytorch:While rewriting keep in mind that by default (as in snippet in question)
nn.MultiheadAttention
expects input in form(seq_length, batch_size, embed_dim)
, buttf.keras.layers.MultiHeadAttention
expects it in form(batch_size, seq_length, embed_dim)
.