多头注意力中的多个头实际上会导致更多的参数或不同的输出吗？

发布于 2025-01-12 14:23:25 字数 2422 浏览 0 评论 0原文

我正在尝试理解变形金刚。虽然我理解编码器-解码器结构的概念以及自我注意背后的想法，但我所坚持的是“多头注意层”的“多头部分”。

看看这个解释 https://jalammar.github.io/illusterated-transformer/，其中我总体感觉很好，似乎使用多个权重矩阵（每个头一组权重矩阵）将原始输入值转换为 query、key 和 <代码>值,然后用于计算注意力分数和 MultiheadAttention 层的实际输出。我还理解多个头的想法，各个注意力头可以专注于不同的部分（如链接中所示）。

然而，这似乎与我所做的其他观察相矛盾：

在原始论文中 https://arxiv.org/abs/1706.03762 ，据说输入被分成每个注意力头大小相等的部分。

因此，例如我有：

batch_size = 1
sequence_length = 12
embed_dim = 512 (I assume that the dimension for ```query```, ```key``` and ```value``` are equal)
Then the shape of my query, key and token would each be [1, 12, 512]
We assume we have five heads, so num_heads = 2
This results in a dimension per head of 512/2=256. According to my understanding this should result in the shape [1, 12, 256] for each attention head.

那么，我是否正确地假设此描述 https://jalammar.github .io/illustated-transformer/ 只是没有正确显示这个因素？

将输入分割到不同的头实际上是否会导致层中的不同计算，或者只是为了使计算更快？

我查看了 torch.nn.MultiheadAttention 中的实现，并在向前传递层期间打印出各个阶段的形状。在我看来，操作是按以下顺序进行的：

使用 in_projection 权重矩阵获取 query、key 和 来自原始输入的值。此后，查询、键和值的形状为[1, 12, 512]。据我了解，这一步中的权重是训练期间在该层中实际学习到的参数。
然后将多头的形状修改为[2,12,256]。
之后，计算 query 和 key 之间的点积，等等。此操作的输出具有形状 [2, 12, 256]。
然后将头部的输出连接起来，形成形状 [12, 512]。
Attention_output 乘以输出投影权重矩阵，我们得到 [12, 1, 512]（批量大小和序列长度有时会互换）。这里我们再次在矩阵内训练权重。

我打印了不同num_heads层中参数的形状，并且参数的数量没有改变：

第一个参数：[1536,512]（输入投影权重矩阵，我假设，1536= 3*512）
第二个参数：[1536]（我假设输入投影偏差）
第三个参数：[512,512]（我假设输出投影权重矩阵）
第四个参数：[512]（我假设）输出投影偏差，我假设）

在此网站上 https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853，说明这只是一个“逻辑分裂”。这似乎符合我自己使用 pytorch 实现的观察结果。

那么注意力头的数量实际上会改变该层输出的值和模型学习的权重吗？在我看来，权重不受头数的影响。那么多个头如何聚焦于不同的部分（类似于卷积层中的过滤器）？

原文

I am trying to understand Transformers. While I understand the concept of the encoder-decoder structure and the idea behind self-attention what I am stuck at is the "multi head part" of the "MultiheadAttention-Layer".

Looking at this explanation https://jalammar.github.io/illustrated-transformer/, which I generally found very good, it appears that multiple weight matrices (one set of weight matrices per head) are used to transform the original input value into the query, key and value, which are then used to calculate the attention scores and the actual output of the MultiheadAttention layer. I also understand the idea of multiple heads to the individual attention heads can focus on different parts (as depicted in the link).

However, this seems to contradict other observations I have made:

In the original paper https://arxiv.org/abs/1706.03762, it is stated that the input is split into parts of equal size per attention head.

So, for example I have:

batch_size = 1
sequence_length = 12
embed_dim = 512 (I assume that the dimension for ```query```, ```key``` and ```value``` are equal)
Then the shape of my query, key and token would each be [1, 12, 512]
We assume we have five heads, so num_heads = 2
This results in a dimension per head of 512/2=256. According to my understanding this should result in the shape [1, 12, 256] for each attention head.

So, am I correct in assuming that this depiction https://jalammar.github.io/illustrated-transformer/ just does not display this factor appropriately?

Does the splitting of the input into different heads actually lead to different calculations in the layer or is it just done to make computations faster?

I have looked at the implementation in torch.nn.MultiheadAttention and printed out the shapes at various stages during the forward pass through the layer. To me it appears that the operations are conducted in the following order:

Use the in_projection weight matrices to get the query, key and value from the original inputs. After this the shape for query, key and value is [1, 12, 512]. From my understanding the weights in this step are the parameters that are actually learned in the layer during training.
Then the shape is modified for the multiple heads into [2, 12, 256].
After this the dot product between query and key is calculated, etc.. The output of this operation has the shape [2, 12, 256].
Then the output of the heads is concatenated which results in the shape [12, 512].
The attention_output is multiplied by the output projection weight matrices and we get [12, 1, 512] (The batch size and the sequence_length is sometimes switched around). Again here we have weights that are being trained inside the matrices.

I printed the shape of the parameters in the layer for different num_heads and the amount of the parameters does not change:

First parameter: [1536,512] (The input projection weight matrix, I assume, 1536=3*512)
Second parameter: [1536] (The input projection bias, I assume)
Third parameter: [512,512] (The output projection weight matrix, I assume)
Fourth parameter: [512] (The output projection bias, I assume)

On this website https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853, it is stated that this is only a "logical split". This seems to fit my own observations using the pytorch implementation.

So does the number of attention heads actually change the values that are outputted by the layer and the weights learned by the model? The way I see it, the weights are not influenced by the number of heads.
Then how can multiple heads focus on different parts (similar to the filters in convolutional layers)?

分享到QQ

分享到微博