ECAPA-TDNN 声纹模型

发布于 2023-07-23 00:06:14 字数 13812 浏览 66 评论 0

一、模型结构

1、Conv1D + Relu + BN

对于 Conv1D + Relu + BN 其实就是一个 TDNN block， TDNN 时延神经网络

class TDNNBlock(nn.Module):
  """An implementation of TDNN.
  Arguements
  ----------
  in_channels : int
    Number of input channels.
  out_channels : int
    The number of output channels.
  kernel_size : int
    The kernel size of the TDNN blocks.
  dilation : int
    The dilation of the Res2Net block.
  activation : torch class
    A class for constructing the activation layers.

  Example
  -------
  >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)   # [8, 64, 120]
  >>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)  # [8, 64, 120]
  >>> out_tensor = layer(inp_tensor).transpose(1, 2)
  >>> out_tensor.shape
  torch.Size([8, 120, 64])
  """

  def __init__(
    self,
    in_channels,
    out_channels,
    kernel_size,
    dilation,
    activation=nn.ReLU,
  ):
    super(TDNNBlock, self).__init__()
    self.conv = Conv1d(
      in_channels=in_channels,
      out_channels=out_channels,
      kernel_size=kernel_size,
      dilation=dilation,
    )
    self.activation = activation()
    self.norm = BatchNorm1d(input_size=out_channels)

  def forward(self, x):
    return self.norm(self.activation(self.conv(x)))

2、SE-Res2Block

Res2Net: 在粒度级别上表示了多尺度特征，并增加了每层的感受野，论文

理解可参考：超越 ResNet：南开提出 Res2Net，不增计算负载，性能全面升级， Res2Net 阅读笔记

可将 Res2Net 与 SENet 结合：将 SE 接到模块末端：

在 ECAPA-TDNN 中，Res2Block 代码参考：

class Res2NetBlock(torch.nn.Module):
  """An implementation of Res2NetBlock w/ dilation.

  Arguments
  ---------
  in_channels : int
    The number of channels expected in the input.
  out_channels : int
    The number of output channels.
  scale : int
    The scale of the Res2Net block.
  dilation : int
    The dilation of the Res2Net block.

  Example
  -------
  >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)   # [8, 64, 120]
  >>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
  >>> out_tensor = layer(inp_tensor).transpose(1, 2)
  >>> out_tensor.shape
  torch.Size([8, 120, 64])
  """

  def __init__(self, in_channels, out_channels, scale=8, dilation=1):
    super(Res2NetBlock, self).__init__()
    assert in_channels % scale == 0
    assert out_channels % scale == 0

    in_channel = in_channels // scale
    hidden_channel = out_channels // scale

    self.blocks = nn.ModuleList(
      [
        TDNNBlock(
          in_channel, hidden_channel, kernel_size=3, dilation=dilation
        )
        for i in range(scale - 1)
      ]
    )
    self.scale = scale

  def forward(self, x):
    y = []
    for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
      if i == 0:
        y_i = x_i
      elif i == 1:
        y_i = self.blocks[i - 1](x_i)
      else:
        y_i = self.blocks[i - 1](x_i + y_i)
      y.append(y_i)
    y = torch.cat(y, dim=1)
    return y

SENet （压缩和激励网络，Sequeze and Excitation Networks)：

ImageNet 2017 冠军模型: SENet 详解， SENet

在 CV 领域：

①Squeeze 部分：压缩部分

原始 feature map 的维度为 H*W*C，其中 H 是高度（Height），W 是宽度（width），C 是通道数（channel）

将 H*W*C 压缩为 1*1*C，相当于把 H*W 压缩成一维了，实际中一般是用 global average pooling 实现的。

将每个二维的特征 H*W 通道压缩成一个实数，这个实数某种程度上感受区域更广，获得了之前 H*W 全局的视野，它表征着在特征通道上响应的全局分布。

②Excitation 部分：激励部分

得到 Squeeze 的 1*1*C 的表示后，加入一个 FC 全连接层（Fully Connected），对每个通道的重要性进行预测

注： [公式] 是一个全连接层操作，其中的维度是 C/r * C，r 是一个缩放参数，在文中取 16，这个参数的是为了减少 channel 个数从而降低计算量，然后再经过一个 ReLU 层，于是 1*1*C 的特征变成了 1*1*C/r。然后再和相乘，这也是一个全连接层， [公式] 的维度是 C*C/r，因此输出维度为 1*1*C，最后经过 sigmoid 函数，得到。

Reweight 操作：得到不同 channel 的重要性大小后再作用（激励）到之前的 feature map 的对应 channel 上，完成在通道维度上的对原始特征的重标定。

在 ECAPA-TDNN 中：

压缩部分：输入特征为[N, C, L] 其中 N 为 batch size，L 为特征帧数， C 为 channel 数，则通过 global average pooling，将特征压缩成[N,C,1]:

激励部分：[N,C,1] -> [N, R, 1] -> [N, C,1] [N,C,1]*[N,C,L] -> [N,C,L]

参考代码：

class SEBlock(nn.Module):
  """An implementation of squeeuze-and-excitation block.

  Arguments
  ---------
  in_channels : int
    The number of input channels.
  se_channels : int
    The number of output channels after squeeze.
  out_channels : int
    The number of output channels.

  Example
  -------
  >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)   # [8, 64, 120]
  >>> se_layer = SEBlock(64, 16, 64)
  >>> lengths = torch.rand((8,))
  >>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
  >>> out_tensor.shape
  torch.Size([8, 120, 64])
  """

  def __init__(self, in_channels, se_channels, out_channels):
    super(SEBlock, self).__init__()

    self.conv1 = Conv1d(
      in_channels=in_channels, out_channels=se_channels, kernel_size=1
    )
    self.relu = torch.nn.ReLU(inplace=True)
    self.conv2 = Conv1d(
      in_channels=se_channels, out_channels=out_channels, kernel_size=1
    )
    self.sigmoid = torch.nn.Sigmoid()

  def forward(self, x, lengths=None):
    L = x.shape[-1]
    if lengths is not None:
      mask = length_to_mask(lengths * L, max_len=L, device=x.device)
      mask = mask.unsqueeze(1)
      total = mask.sum(dim=2, keepdim=True)
      s = (x * mask).sum(dim=2, keepdim=True) / total
    else:
      s = x.mean(dim=2, keepdim=True)   # [8 ,64, 1]

    s = self.relu(self.conv1(s))  # [8， 16， 1]
    s = self.sigmoid(self.conv2(s))   # [8, 64, 1]

    return s * x   # [8, 64, 120]

SE-Res2Block

3、Attentive Statistics Polling(注意力统计池化)

Attentive Statistics Pooling for Deep Speaker Embedding

在 ECAPA-TDNN 中的说明如下，本质上是计算加权平均值和加权标准差:

参考代码：

class AttentiveStatisticsPooling(nn.Module):
  """This class implements an attentive statistic pooling layer for each channel.
  It returns the concatenated mean and std of the input tensor.

  Arguments
  ---------
  channels: int
    The number of input channels.
  attention_channels: int
    The number of attention channels.

  Example
  -------
  >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)   # [80, 64, 120]
  >>> asp_layer = AttentiveStatisticsPooling(64)
  >>> lengths = torch.rand((8,))
  >>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
  >>> out_tensor.shape
  torch.Size([8, 1, 128])
  """

  def __init__(self, channels, attention_channels=128, global_context=True):
    super().__init__()

    self.eps = 1e-12
    self.global_context = global_context
    if global_context:
      self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
    else:
      self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
    self.tanh = nn.Tanh()
    self.conv = Conv1d(
      in_channels=attention_channels, out_channels=channels, kernel_size=1
    )

  def forward(self, x, lengths=None):
    """Calculates mean and std for a batch (input tensor).

    Arguments
    ---------
    x : torch.Tensor
      Tensor of shape [N, C, L].
    """
    L = x.shape[-1]

    def _compute_statistics(x, m, dim=2, eps=self.eps):
      mean = (m * x).sum(dim)
      std = torch.sqrt(
        (m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps)
      )
      return mean, std

    if lengths is None:
      lengths = torch.ones(x.shape[0], device=x.device)

    # Make binary mask of shape [N, 1, L]
    mask = length_to_mask(lengths * L, max_len=L, device=x.device)
    mask = mask.unsqueeze(1)

    # Expand the temporal context of the pooling layer by allowing the
    # self-attention to look at global properties of the utterance.
    if self.global_context:
      # torch.std is unstable for backward computation
      # https://github.com/pytorch/pytorch/issues/4320
      total = mask.sum(dim=2, keepdim=True).float()
      mean, std = _compute_statistics(x, mask / total)
      mean = mean.unsqueeze(2).repeat(1, 1, L)
      std = std.unsqueeze(2).repeat(1, 1, L)
      attn = torch.cat([x, mean, std], dim=1)
    else:
      attn = x

    # Apply layers
    attn = self.conv(self.tanh(self.tdnn(attn)))

    # Filter out zero-paddings
    attn = attn.masked_fill(mask == 0, float("-inf"))

    attn = F.softmax(attn, dim=2)
    mean, std = _compute_statistics(x, attn)
    # Append mean and std of the batch
    pooled_stats = torch.cat((mean, std), dim=1)
    pooled_stats = pooled_stats.unsqueeze(2)

    return pooled_stats

最后经过全连接层得到一个较低维度的特征，这就是模型提取的说话人嵌入向量

二、损失函数 AAM-Softmax

首先经过一个分类器：

class Classifier(torch.nn.Module):
  """This class implements the cosine similarity on the top of features.

  Arguments
  ---------
  device : str
    Device used, e.g., "cpu" or "cuda".
  lin_blocks : int
    Number of linear layers.
  lin_neurons : int
    Number of neurons in linear layers.
  out_neurons : int
    Number of classes.

  Example
  -------
  >>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
  >>> outputs = torch.tensor([ [1., -1.], [-9., 1.], [0.9, 0.1], [0.1, 0.9] ])
  >>> outupts = outputs.unsqueeze(1)
  >>> cos = classify(outputs)
  >>> (cos < -1.0).long().sum()
  tensor(0)
  >>> (cos > 1.0).long().sum()
  tensor(0)
  """

  def __init__(
    self,
    input_size,
    device="cpu",
    lin_blocks=0,
    lin_neurons=192,
    out_neurons=1211,
  ):

    super().__init__()
    self.blocks = nn.ModuleList()

    for block_index in range(lin_blocks):
      self.blocks.extend(
        [
          _BatchNorm1d(input_size),
          Linear(input_size=input_size, n_neurons=lin_neurons),
        ]
      )
      input_size = lin_neurons

    # Final Layer
    self.weight = nn.Parameter(
      torch.FloatTensor(out_neurons, input_size, device=device)
    )
    nn.init.xavier_uniform_(self.weight)

  def forward(self, x):
    """Returns the output probabilities over speakers.

    Arguments
    ---------
    x : torch.Tensor
      Torch tensor.
    """
    for layer in self.blocks:
      x = layer(x)

    # Need to be normalized
    x = F.linear(F.normalize(x.squeeze(1)), F.normalize(self.weight))
    return x.unsqueeze(1)

x = F.linear(F.normalize(x.squeeze(1)), F.normalize(self.weight)) 这是要将 x 与 weight 进行归一化操作，然后再去求 AAM-Softmax loss。

论文： ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verificationarxiv.org

代码可参考 speechbrain 中关于 ECAPA-TDNN 的实现，本文所示代码都来源于 speechbrain： speechbrain/speechbraingithub.com

分享到QQ

分享到微博