create_padding_mask在变压器代码中使用编码器输入序列,以在解码器的第二个注意力块中创建填充掩码
我将在tensorflow.org上浏览变压器代码 - https://www.tensorflow.org/文本/教程/变压器
def create_masks(self, inp, tar):
# Encoder padding mask (Used in the 2nd attention block in the decoder too.)
padding_mask = create_padding_mask(inp)
# Used in the 1st attention block in the decoder.
# It is used to pad and mask future tokens in the input received by
# the decoder.
look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
dec_target_padding_mask = create_padding_mask(tar)
look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
return padding_mask, look_ahead_mask
变压器类具有一种称为create_masks的方法,该方法可创建填充并向前看掩码。 我知道编码器的填充蒙版应采用输入序列(输入到编码器)以创建填充面膜。但是,我不明白的是,为什么应该使用编码器的输入序列来为解码器的第二个注意力块(代码的第一行)创建填充掩码。我认为应该使用目标序列(将其馈送到解码器)创建解码器的填充面膜。
请帮助我了解为什么这样做。
I am going through the Transformer code on tensorflow.org - https://www.tensorflow.org/text/tutorials/transformer
def create_masks(self, inp, tar):
# Encoder padding mask (Used in the 2nd attention block in the decoder too.)
padding_mask = create_padding_mask(inp)
# Used in the 1st attention block in the decoder.
# It is used to pad and mask future tokens in the input received by
# the decoder.
look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
dec_target_padding_mask = create_padding_mask(tar)
look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
return padding_mask, look_ahead_mask
Transformer class has a method called create_masks which creates padding and look ahead mask.
I understand that padding mask for encoder should take input sequence(input to the encoder) for creating padding mask. However, what I do not understand is why should the input sequence to the encoder should be used for creating padding mask for second attention block of the decoder(first line of the code). I think the padding mask for decoder should be created using the target sequence(which is fed to the decoder).
Please help me understand why this is done.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您应该了解编码器中的填充面膜沿关注得分中的密钥轴应用(如果将注意力分数视为2D矩阵,那就是J或第二轴)。
然后来解码器,编码器输出提供了解码器注意层(解码器中的第二个注意力层)的关键,而目标输入仅通过解码器的第一个注意力层提供查询。由于沿钥匙轴施加填充蒙版,因此填充面膜应具有与源句子长度相同的维度。
first you should understand the padding mask in encoder is applied along the axis of key in the attention score (if you view attention score as a 2d matrix, it's the j, or 2nd axis).
then come to decoder, encoder output provides key to the decoder attention layer (2nd attention layer in the decoder) while target input only provides query through decoder's 1st attention layer. since padding mask is applied along key axis, the padding mask should has the same dimension as source sentence length.