如何更改我的神经网络输入的维度?

发布于 2025-01-20 06:39:38 字数 1383 浏览 4 评论 0 原文

我想实现以下模型,用于从语音中验证说话者。从文件夹中读取 wav 文件后,我使用 log Mel 滤波器组能量来提取特征。现在,我想使用这个特征作为输入,但如图所示,输入特征必须是80,但1430适合我。在这里我必须使用批量大小来分割特征,或者我必须使用降维技术? (我使用 python 和 pytorch 来实现)

说话人验证模型

我的前 4 层代码:

class NeuralNetwork(nn.Module):
    def __init__(self, num_class):
        super(NeuralNetwork, self).__init__()
        self.conv1 = nn.Sequential(nn.Conv2d(1, 80, T),
              BatchNorm2d(4),
              ReLU(inplace=True),
              MaxPool2d(kernel_size=2, stride=2))
        
       self.conv2 = nn.Sequential(nn.Conv2d(128, 40, T),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=1))

       self.conv3 = nn.Sequential(nn.Conv2d(128, 40, T),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=1))

        self.conv4 = nn.Sequential(nn.Conv2d(128, 40, T),
              BatchNorm2d(4),
              ReLU(inplace=True),
              MaxPool2d(kernel_size=2, stride=1))           

       self.conv5 = nn.Sequential(nn.Conv2d(128, 20, T = flatten),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=2)) 

但是从音频中提取的特征维度不是 80,我不知道如何将其更改为 80。

I want to implementation of bellow model for speaker verification from speech. After reading wav file from folder, I am using log Mel filter bank energy for extract features. Now, I want to use this feature as input but as describe in the figure, input feature must be 80, but 1430 is for me. Here I must to use batch size for segmentation the features or I have to use dimension reduction technique? (I'm using python and pytorch for implementation)

speaker verification model

My code for 4 first layers:

class NeuralNetwork(nn.Module):
    def __init__(self, num_class):
        super(NeuralNetwork, self).__init__()
        self.conv1 = nn.Sequential(nn.Conv2d(1, 80, T),
              BatchNorm2d(4),
              ReLU(inplace=True),
              MaxPool2d(kernel_size=2, stride=2))
        
       self.conv2 = nn.Sequential(nn.Conv2d(128, 40, T),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=1))

       self.conv3 = nn.Sequential(nn.Conv2d(128, 40, T),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=1))

        self.conv4 = nn.Sequential(nn.Conv2d(128, 40, T),
              BatchNorm2d(4),
              ReLU(inplace=True),
              MaxPool2d(kernel_size=2, stride=1))           

       self.conv5 = nn.Sequential(nn.Conv2d(128, 20, T = flatten),
             BatchNorm2d(4),
             ReLU(inplace=True),
             MaxPool2d(kernel_size=2, stride=2)) 

But dimension of feature extracted from audio is not 80, I don't know how can I change it to 80.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

余厌 2025-01-27 06:39:38

我对Pytorch不太熟悉,我对TensorFlow的了解更多。但是我可以帮助您阅读文档。看来Pytorch需要知道输入和输出维度。通常, conv2d(i,o,(k1,k2))会修改尺寸的输入数据 b x i x h x h x w ,然后将其转换为诸如<<代码> B X O X(H-K1)X(W-K2);其中 b 是batch_size。您似乎在错误的位置中有变量(例如, t 出现在宽度上),并且缺少 bk layer(我不知道那是什么或它的作用)。看来本文保持尺寸相同,因此您需要使用一些填充, padding =“ same” 。还请注意,内核确实会影响输入或输出尺寸(这很有意义;查找如何通过与高斯内核进行卷积来模糊图像)。您的图层应该看起来像:

# (B x 1 x 80 x T)
nn.Conv2d(1, C, (3,3), padding="same")
# (B x C x 80 x T)
BK_layer # I don't know what this does but it appears to cut the height in half.
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock # I don't know what this does but it appears to not change the dimensions. 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
BK_layer 
# (B x C x 20 x T) # got cut in half again
ECAPA-TDNN
# S x 1

我不知道ecapa-tdnn的相互作用如何达到 sx1 尺寸;但是它似乎涉及一些扁平化,听起来它涉及与多个堆栈的合并层。

为简单起见,我省略了激活层。我强烈建议您阅读有关与内核图像上的“卷动”的信息。当您了解如何通过图像进行卷积允许您检测边缘或模糊图像时,卷积层变得更有意义。 “高斯模糊”和“ NN卷积层”之间的区别在于,一个是固定的(通常是在GIMP或Photoshop中使用的按钮),而另一个是动态的,而随着网络学习的方式而变化。

如您所见,预期输入将<代码> 1 x 80 x t 。我不确定您期望如何适应1430。从添加填充物到删除值,使用其他图层等,有多种方法可以获取所需的形式。到 1 x 80 x 18

I am not as familiar with pytorch, I know more about tensorflow. But I can help you in reading the documentation https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html. It appears pytorch requires to know the input and output dimensions. In general, a conv2d(I,O,(k1,k2)) would modify input data of dimensions B x I x H x W and turn it into data like B x O x (H-k1) x (W-k2); where B is the batch_size. You appear to have variables in the wrong place (for example, T appears to the be the width) and are missing the BK layer (I don't know what that is or what it does). It also appears that the paper is keeping the dimensions the same, so you need to use some padding, padding="same". Notice also that the kernel does effect the input or output dimensions (which makes sense; look up how to blur an image by convolving with a Gaussian kernel). Your layers should look something like:

# (B x 1 x 80 x T)
nn.Conv2d(1, C, (3,3), padding="same")
# (B x C x 80 x T)
BK_layer # I don't know what this does but it appears to cut the height in half.
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock # I don't know what this does but it appears to not change the dimensions. 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
ResBlock 
# (B x C x 40 x T)
nn.Conv2d(C, C, (3,3), padding="same")
# (B x C x 40 x T)
BK_layer 
# (B x C x 20 x T) # got cut in half again
ECAPA-TDNN
# S x 1

I don't know how the interaction of ECAPA-TDNN reaches Sx1 dimensions; but it appears to involving some flattening and it sounds like it involves a merge layer with multiple stacks.

I left out the activation layers for simplicity. I highly recommend to read about "convolving" over an image with a kernel. Convolutional layers make a lot more sense when you understand how convolving over an image allows you to detect edges or blur images. The difference between "Gaussian blurring" and "NN Convolution layer" is that one is fixed (typically a button to use in GIMP or Photoshop) and the other is dynamic with an evolving kernel that changes as the network learns.

As you can see the input is expected to 1 x 80 x T. I am unsure how you expect to fit your 1430 into that. There are a number of ways to get the form you want, from adding padding, to dropping values, using an additional layer, etc. For example, if 1430 was padded with 10 zeros at the end, it would be 1440, which can reshape to 1 x 80 x 18.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文