如何更改我的神经网络输入的维度?
我想实现以下模型,用于从语音中验证说话者。从文件夹中读取 wav 文件后,我使用 log Mel 滤波器组能量来提取特征。现在,我想使用这个特征作为输入,但如图所示,输入特征必须是80,但1430适合我。在这里我必须使用批量大小来分割特征,或者我必须使用降维技术? (我使用 python 和 pytorch 来实现)
我的前 4 层代码:
class NeuralNetwork(nn.Module):
def __init__(self, num_class):
super(NeuralNetwork, self).__init__()
self.conv1 = nn.Sequential(nn.Conv2d(1, 80, T),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2))
self.conv2 = nn.Sequential(nn.Conv2d(128, 40, T),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=1))
self.conv3 = nn.Sequential(nn.Conv2d(128, 40, T),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=1))
self.conv4 = nn.Sequential(nn.Conv2d(128, 40, T),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=1))
self.conv5 = nn.Sequential(nn.Conv2d(128, 20, T = flatten),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2))
但是从音频中提取的特征维度不是 80,我不知道如何将其更改为 80。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对Pytorch不太熟悉,我对TensorFlow的了解更多。但是我可以帮助您阅读文档。看来Pytorch需要知道输入和输出维度。通常,
conv2d(i,o,(k1,k2))
会修改尺寸的输入数据b x i x h x h x w
,然后将其转换为诸如<<代码> B X O X(H-K1)X(W-K2)
;其中b
是batch_size。您似乎在错误的位置中有变量(例如,t
出现在宽度上),并且缺少bk
layer(我不知道那是什么或它的作用)。看来本文保持尺寸相同,因此您需要使用一些填充,padding =“ same”
。还请注意,内核
确实会影响输入或输出尺寸(这很有意义;查找如何通过与高斯内核进行卷积来模糊图像)。您的图层应该看起来像:我不知道ecapa-tdnn的相互作用如何达到
sx1
尺寸;但是它似乎涉及一些扁平化,听起来它涉及与多个堆栈的合并层。为简单起见,我省略了激活层。我强烈建议您阅读有关与内核图像上的“卷动”的信息。当您了解如何通过图像进行卷积允许您检测边缘或模糊图像时,卷积层变得更有意义。 “高斯模糊”和“ NN卷积层”之间的区别在于,一个是固定的(通常是在GIMP或Photoshop中使用的按钮),而另一个是动态的,而随着网络学习的方式而变化。
如您所见,预期输入将<代码> 1 x 80 x t 。我不确定您期望如何适应1430。从添加填充物到删除值,使用其他图层等,有多种方法可以获取所需的形式。到
1 x 80 x 18
。I am not as familiar with pytorch, I know more about tensorflow. But I can help you in reading the documentation https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html. It appears pytorch requires to know the input and output dimensions. In general, a
conv2d(I,O,(k1,k2))
would modify input data of dimensionsB x I x H x W
and turn it into data likeB x O x (H-k1) x (W-k2)
; whereB
is the batch_size. You appear to have variables in the wrong place (for example,T
appears to the be the width) and are missing theBK
layer (I don't know what that is or what it does). It also appears that the paper is keeping the dimensions the same, so you need to use some padding,padding="same"
. Notice also that thekernel
does effect the input or output dimensions (which makes sense; look up how to blur an image by convolving with a Gaussian kernel). Your layers should look something like:I don't know how the interaction of ECAPA-TDNN reaches
Sx1
dimensions; but it appears to involving some flattening and it sounds like it involves a merge layer with multiple stacks.I left out the activation layers for simplicity. I highly recommend to read about "convolving" over an image with a kernel. Convolutional layers make a lot more sense when you understand how convolving over an image allows you to detect edges or blur images. The difference between "Gaussian blurring" and "NN Convolution layer" is that one is fixed (typically a button to use in GIMP or Photoshop) and the other is dynamic with an evolving kernel that changes as the network learns.
As you can see the input is expected to
1 x 80 x T
. I am unsure how you expect to fit your 1430 into that. There are a number of ways to get the form you want, from adding padding, to dropping values, using an additional layer, etc. For example, if 1430 was padded with 10 zeros at the end, it would be 1440, which can reshape to1 x 80 x 18
.