ECAPA-TDNN 声纹模型
一、模型结构
1、Conv1D + Relu + BN
对于 Conv1D + Relu + BN 其实就是一个 TDNN block, TDNN 时延神经网络
class TDNNBlock(nn.Module):
"""An implementation of TDNN.
Arguements
----------
in_channels : int
Number of input channels.
out_channels : int
The number of output channels.
kernel_size : int
The kernel size of the TDNN blocks.
dilation : int
The dilation of the Res2Net block.
activation : torch class
A class for constructing the activation layers.
Example
-------
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) # [8, 64, 120]
>>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1) # [8, 64, 120]
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
"""
def __init__(
self,
in_channels,
out_channels,
kernel_size,
dilation,
activation=nn.ReLU,
):
super(TDNNBlock, self).__init__()
self.conv = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
dilation=dilation,
)
self.activation = activation()
self.norm = BatchNorm1d(input_size=out_channels)
def forward(self, x):
return self.norm(self.activation(self.conv(x)))
2、SE-Res2Block
Res2Net: 在粒度级别上表示了多尺度特征,并增加了每层的感受野, 论文
理解可参考: 超越 ResNet:南开提出 Res2Net,不增计算负载,性能全面升级 , Res2Net 阅读笔记
可将 Res2Net 与 SENet 结合:将 SE 接到模块末端:
在 ECAPA-TDNN 中,Res2Block 代码参考:
class Res2NetBlock(torch.nn.Module):
"""An implementation of Res2NetBlock w/ dilation.
Arguments
---------
in_channels : int
The number of channels expected in the input.
out_channels : int
The number of output channels.
scale : int
The scale of the Res2Net block.
dilation : int
The dilation of the Res2Net block.
Example
-------
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) # [8, 64, 120]
>>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
"""
def __init__(self, in_channels, out_channels, scale=8, dilation=1):
super(Res2NetBlock, self).__init__()
assert in_channels % scale == 0
assert out_channels % scale == 0
in_channel = in_channels // scale
hidden_channel = out_channels // scale
self.blocks = nn.ModuleList(
[
TDNNBlock(
in_channel, hidden_channel, kernel_size=3, dilation=dilation
)
for i in range(scale - 1)
]
)
self.scale = scale
def forward(self, x):
y = []
for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
if i == 0:
y_i = x_i
elif i == 1:
y_i = self.blocks[i - 1](x_i)
else:
y_i = self.blocks[i - 1](x_i + y_i)
y.append(y_i)
y = torch.cat(y, dim=1)
return y
SENet (压缩和激励网络,Sequeze and Excitation Networks):
ImageNet 2017 冠军模型: SENet 详解 , SENet
在 CV 领域:
①Squeeze 部分:压缩部分
原始 feature map 的维度为 H*W*C,其中 H 是高度(Height),W 是宽度(width),C 是通道数(channel)
将 H*W*C 压缩为 1*1*C,相当于把 H*W 压缩成一维了,实际中一般是用 global average pooling 实现的。
将每个二维的特征 H*W 通道压缩成一个实数,这个实数某种程度上感受区域更广,获得了之前 H*W 全局的视野,它表征着在特征通道上响应的全局分布。
②Excitation 部分:激励部分
得到 Squeeze 的 1*1*C
的表示后,加入一个 FC 全连接层(Fully Connected),对每个通道的重要性进行预测
注: 是一个全连接层操作,其中
的维度是 C/r * C,r 是一个缩放参数,在文中取 16,这个参数的是为了减少 channel 个数从而降低计算量,然后再经过一个 ReLU 层,于是 1*1*C 的特征变成了 1*1*C/r。然后再和
相乘,这也是一个全连接层,
的维度是 C*C/r,因此输出维度为 1*1*C,最后经过 sigmoid 函数,得到
。
Reweight 操作:得到不同 channel 的重要性大小后再作用(激励)到之前的 feature map 的对应 channel 上,完成在通道维度上的对原始特征的重标定。
在 ECAPA-TDNN 中:
压缩部分:输入特征为[N, C, L] 其中 N 为 batch size,L 为特征帧数, C 为 channel 数,则通过 global average pooling,将特征压缩成[N,C,1]:
激励部分:[N,C,1] -> [N, R, 1] -> [N, C,1] [N,C,1]*[N,C,L] -> [N,C,L]
参考代码:
class SEBlock(nn.Module):
"""An implementation of squeeuze-and-excitation block.
Arguments
---------
in_channels : int
The number of input channels.
se_channels : int
The number of output channels after squeeze.
out_channels : int
The number of output channels.
Example
-------
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) # [8, 64, 120]
>>> se_layer = SEBlock(64, 16, 64)
>>> lengths = torch.rand((8,))
>>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
"""
def __init__(self, in_channels, se_channels, out_channels):
super(SEBlock, self).__init__()
self.conv1 = Conv1d(
in_channels=in_channels, out_channels=se_channels, kernel_size=1
)
self.relu = torch.nn.ReLU(inplace=True)
self.conv2 = Conv1d(
in_channels=se_channels, out_channels=out_channels, kernel_size=1
)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x, lengths=None):
L = x.shape[-1]
if lengths is not None:
mask = length_to_mask(lengths * L, max_len=L, device=x.device)
mask = mask.unsqueeze(1)
total = mask.sum(dim=2, keepdim=True)
s = (x * mask).sum(dim=2, keepdim=True) / total
else:
s = x.mean(dim=2, keepdim=True) # [8 ,64, 1]
s = self.relu(self.conv1(s)) # [8, 16, 1]
s = self.sigmoid(self.conv2(s)) # [8, 64, 1]
return s * x # [8, 64, 120]
SE-Res2Block
3、Attentive Statistics Polling(注意力统计池化)
Attentive Statistics Pooling for Deep Speaker Embedding
在 ECAPA-TDNN 中的说明如下,本质上是计算加权平均值和加权标准差:
参考代码:
class AttentiveStatisticsPooling(nn.Module):
"""This class implements an attentive statistic pooling layer for each channel.
It returns the concatenated mean and std of the input tensor.
Arguments
---------
channels: int
The number of input channels.
attention_channels: int
The number of attention channels.
Example
-------
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) # [80, 64, 120]
>>> asp_layer = AttentiveStatisticsPooling(64)
>>> lengths = torch.rand((8,))
>>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 1, 128])
"""
def __init__(self, channels, attention_channels=128, global_context=True):
super().__init__()
self.eps = 1e-12
self.global_context = global_context
if global_context:
self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
else:
self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
self.tanh = nn.Tanh()
self.conv = Conv1d(
in_channels=attention_channels, out_channels=channels, kernel_size=1
)
def forward(self, x, lengths=None):
"""Calculates mean and std for a batch (input tensor).
Arguments
---------
x : torch.Tensor
Tensor of shape [N, C, L].
"""
L = x.shape[-1]
def _compute_statistics(x, m, dim=2, eps=self.eps):
mean = (m * x).sum(dim)
std = torch.sqrt(
(m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps)
)
return mean, std
if lengths is None:
lengths = torch.ones(x.shape[0], device=x.device)
# Make binary mask of shape [N, 1, L]
mask = length_to_mask(lengths * L, max_len=L, device=x.device)
mask = mask.unsqueeze(1)
# Expand the temporal context of the pooling layer by allowing the
# self-attention to look at global properties of the utterance.
if self.global_context:
# torch.std is unstable for backward computation
# https://github.com/pytorch/pytorch/issues/4320
total = mask.sum(dim=2, keepdim=True).float()
mean, std = _compute_statistics(x, mask / total)
mean = mean.unsqueeze(2).repeat(1, 1, L)
std = std.unsqueeze(2).repeat(1, 1, L)
attn = torch.cat([x, mean, std], dim=1)
else:
attn = x
# Apply layers
attn = self.conv(self.tanh(self.tdnn(attn)))
# Filter out zero-paddings
attn = attn.masked_fill(mask == 0, float("-inf"))
attn = F.softmax(attn, dim=2)
mean, std = _compute_statistics(x, attn)
# Append mean and std of the batch
pooled_stats = torch.cat((mean, std), dim=1)
pooled_stats = pooled_stats.unsqueeze(2)
return pooled_stats
最后经过全连接层得到一个较低维度的特征,这就是模型提取的说话人嵌入向量
二、损失函数 AAM-Softmax
首先经过一个分类器:
class Classifier(torch.nn.Module):
"""This class implements the cosine similarity on the top of features.
Arguments
---------
device : str
Device used, e.g., "cpu" or "cuda".
lin_blocks : int
Number of linear layers.
lin_neurons : int
Number of neurons in linear layers.
out_neurons : int
Number of classes.
Example
-------
>>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
>>> outputs = torch.tensor([ [1., -1.], [-9., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> outupts = outputs.unsqueeze(1)
>>> cos = classify(outputs)
>>> (cos < -1.0).long().sum()
tensor(0)
>>> (cos > 1.0).long().sum()
tensor(0)
"""
def __init__(
self,
input_size,
device="cpu",
lin_blocks=0,
lin_neurons=192,
out_neurons=1211,
):
super().__init__()
self.blocks = nn.ModuleList()
for block_index in range(lin_blocks):
self.blocks.extend(
[
_BatchNorm1d(input_size),
Linear(input_size=input_size, n_neurons=lin_neurons),
]
)
input_size = lin_neurons
# Final Layer
self.weight = nn.Parameter(
torch.FloatTensor(out_neurons, input_size, device=device)
)
nn.init.xavier_uniform_(self.weight)
def forward(self, x):
"""Returns the output probabilities over speakers.
Arguments
---------
x : torch.Tensor
Torch tensor.
"""
for layer in self.blocks:
x = layer(x)
# Need to be normalized
x = F.linear(F.normalize(x.squeeze(1)), F.normalize(self.weight))
return x.unsqueeze(1)
x = F.linear(F.normalize(x.squeeze(1)), F.normalize(self.weight))
这是要将 x 与 weight 进行归一化操作,然后再去求 AAM-Softmax loss。
代码可参考 speechbrain 中关于 ECAPA-TDNN 的实现,本文所示代码都来源于 speechbrain: speechbrain/speechbraingithub.com
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

上一篇: Effective C++ 笔记
下一篇: 谈谈自己对于 AOP 的了解
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论