加速 PyTorch 中的一维卷积

发布于 2025-01-18 03:25:40 字数 3949 浏览 2 评论 0原文

对于我的项目，我将pytorch用作线性代数后端。对于我的代码的性能部分，我需要进行2个小（2至9之间的长度）矢量（1D张量）的1D卷积。我的代码允许对输入进行批量处理，因此我可以堆叠几个输入向量来创建矩阵，然后可以同时将其全部卷积。由于torch.conv1d不允许沿2D输入的单个维度进行卷积，因此我必须编写自己的卷积功能，称为confolve。但是，这个新功能由双循环组成，因此非常慢。

问题：如何使卷曲函数通过更好的代码设计更快地执行，并让它能够处理批处理输入（= 2D张量）？

部分答案：< Strong>以某种方式避免了下面的双面式前面的

是三个jupyter笔记本电脑单元，它们重新创建了一个最小的例子。请注意，您需要line_profiler和%% writefile魔术命令以使此工作！

%%writefile SO_CONVOLVE_QUESTION.py
import torch

def conv1d(a, v):
    padding = v.shape[-1] - 1
    return torch.conv1d(
        input=a.view(1, 1, -1), weight=v.flip(0).view(1, 1, -1), padding=padding, stride=1
    ).squeeze()

def convolve(a, v):
    if a.ndim == 1:
        a = a.view(1, -1)
        v = v.view(1, -1) 

    nrows, vcols = v.shape
    acols = a.shape[1]

    expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols))
    noutdim = vcols + acols - 1
    out = torch.zeros((nrows, noutdim))
    for i in range(acols):  
        for j in range(vcols):
            out[:, i+j] += expanded[:, i, j]  
    return out.squeeze()
    
x = torch.randn(5)
y = torch.randn(7)

我将代码写入so_convolve_question.py，因为这是line_profiler的必要条件，并用作timeit.timeit.timeit.timeit的设置。

现在，我们可以在非批量输入（x，y）上评估上述代码的输出和性能，并批量输入（x_batch，y_batch）：

from SO_CONVOLVE_QUESTION import *
# Without batch processing
res1 = conv1d(x, y)
res = convolve(x, y)
print(torch.allclose(res1, res)) # True

# With batch processing, NB first dimension!
x_batch = torch.randn(5, 5)
y_batch = torch.randn(5, 7)

results = []
for i in range(5):
    results.append(conv1d(x_batch[i, :], y_batch[i, :]))
res1 = torch.stack(results)
res = convolve(x_batch, y_batch)
print(torch.allclose(res1, res))  # True

print(timeit.timeit('convolve(x, y)', setup=setup, number=10000)) # 4.83391789999996
print(timeit.timeit('conv1d(x, y)', setup=setup, number=10000))   # 0.2799923000000035

在您上方的块中可以看到，使用Conv1D功能执行5次卷积会产生与批处理输入上的Convolve相同的结果。我们还可以看到，卷曲（= 4.8s）比conv1d（= 0.28s）慢得多。在下面，我们评估浏览函数的慢部分，而无需使用line_profiler：

%load_ext line_profiler
%lprun -f convolve convolve(x, y)  # evaluated without batch-processing!

输出：

Timer unit: 1e-07 s

Total time: 0.0010383 s
File: C:\python_projects\pysumo\SO_CONVOLVE_QUESTION.py
Function: convolve at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def convolve(a, v):
    10         1         68.0     68.0      0.7      if a.ndim == 1:
    11         1        271.0    271.0      2.6          a = a.view(1, -1)
    12         1         44.0     44.0      0.4          v = v.view(1, -1) 
    13                                           
    14         1         28.0     28.0      0.3      nrows, vcols = v.shape
    15         1         12.0     12.0      0.1      acols = a.shape[1]
    16                                           
    17         1       4337.0   4337.0     41.8      expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols))
    18         1         12.0     12.0      0.1      noutdim = vcols + acols - 1
    19         1        127.0    127.0      1.2      out = torch.zeros((nrows, noutdim))
    20         6         32.0      5.3      0.3      for i in range(acols):  
    21        40        209.0      5.2      2.0          for j in range(vcols):
    22        35       5194.0    148.4     50.0              out[:, i+j] += expanded[:, i, j]  
    23         1         49.0     49.0      0.5      return out.squeeze()

显然是双循环的，而创建的行显然是展开的行>张量最慢。这些零件是否可以通过更好的代码设计来避免？

原文

For my project I am using pytorch as a linear algebra backend. For the performance part of my code, I need to do 1D convolutions of 2 small (length between 2 and 9) vectors (1D tensors) a very large number of times. My code allows for batch-processing of inputs and thus I can stack a couple of input vectors to create matrices that can then be convolved all at the same time. Since torch.conv1d does not allow for convolving along a single dimension for 2D inputs, I had to write my own convolution function called convolve. This new function however consists of a double for-loop and is therefore very very slow.

Question: how can I make the convolve function perform faster through better code-design and let it be able to deal with batched inputs (=2D tensors)?

Partial answer: somehow avoid the double for-loop

Below are three jupyter notebook cells that recreate a minimal example. Note that the you need line_profiler and the %%writefile magic command to make this work!

%%writefile SO_CONVOLVE_QUESTION.py
import torch

def conv1d(a, v):
    padding = v.shape[-1] - 1
    return torch.conv1d(
        input=a.view(1, 1, -1), weight=v.flip(0).view(1, 1, -1), padding=padding, stride=1
    ).squeeze()

def convolve(a, v):
    if a.ndim == 1:
        a = a.view(1, -1)
        v = v.view(1, -1) 

    nrows, vcols = v.shape
    acols = a.shape[1]

    expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols))
    noutdim = vcols + acols - 1
    out = torch.zeros((nrows, noutdim))
    for i in range(acols):  
        for j in range(vcols):
            out[:, i+j] += expanded[:, i, j]  
    return out.squeeze()
    
x = torch.randn(5)
y = torch.randn(7)

I write the code to the SO_CONVOLVE_QUESTION.py because that is necessary for line_profiler and to use as a setup for timeit.timeit.

Now we can evaluate the output and performance of the code above on non-batch input (x, y) and batched input (x_batch, y_batch):

from SO_CONVOLVE_QUESTION import *
# Without batch processing
res1 = conv1d(x, y)
res = convolve(x, y)
print(torch.allclose(res1, res)) # True

# With batch processing, NB first dimension!
x_batch = torch.randn(5, 5)
y_batch = torch.randn(5, 7)

results = []
for i in range(5):
    results.append(conv1d(x_batch[i, :], y_batch[i, :]))
res1 = torch.stack(results)
res = convolve(x_batch, y_batch)
print(torch.allclose(res1, res))  # True

print(timeit.timeit('convolve(x, y)', setup=setup, number=10000)) # 4.83391789999996
print(timeit.timeit('conv1d(x, y)', setup=setup, number=10000))   # 0.2799923000000035

In the block above you can see that performing convolution 5 times using the conv1d function produces the same result as convolve on batched inputs. We can also see that convolve (= 4.8s) is much slower than the conv1d (= 0.28s). Below we assess the slow part of the convolve function WITHOUT batch processing using line_profiler:

%load_ext line_profiler
%lprun -f convolve convolve(x, y)  # evaluated without batch-processing!

Output:

Timer unit: 1e-07 s

Total time: 0.0010383 s
File: C:\python_projects\pysumo\SO_CONVOLVE_QUESTION.py
Function: convolve at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def convolve(a, v):
    10         1         68.0     68.0      0.7      if a.ndim == 1:
    11         1        271.0    271.0      2.6          a = a.view(1, -1)
    12         1         44.0     44.0      0.4          v = v.view(1, -1) 
    13                                           
    14         1         28.0     28.0      0.3      nrows, vcols = v.shape
    15         1         12.0     12.0      0.1      acols = a.shape[1]
    16                                           
    17         1       4337.0   4337.0     41.8      expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols))
    18         1         12.0     12.0      0.1      noutdim = vcols + acols - 1
    19         1        127.0    127.0      1.2      out = torch.zeros((nrows, noutdim))
    20         6         32.0      5.3      0.3      for i in range(acols):  
    21        40        209.0      5.2      2.0          for j in range(vcols):
    22        35       5194.0    148.4     50.0              out[:, i+j] += expanded[:, i, j]  
    23         1         49.0     49.0      0.5      return out.squeeze()

Obviously a double for-loop and the line creating the expanded tensor are the slowest. Are these parts avoidable with better code-design?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨小沫ゞ 2025-01-25 03:25:40

Pytorch拥有一个批处理分析工具，称为torch.nn.Functiuntional，在那里您有一个conv1d函数（显然还有2D，还有更多）。我们将使用 conv1d 。

假设您要在v1中给出的100个向量与v2给出的另一个向量。 v1具有的维度（minibatch，in Channels，striges），默认情况下需要1个频道。此外，v2具有 *（\ text {out_channels}的尺寸（out_channels，off_channels，groups/in_channels，kw） *。 >和v2将由：

import torch
from torch.nn import functional as F

num_vectors = 100
len_vectors = 9
v1 = torch.rand((num_vectors, 1, len_vectors))
v2 = torch.rand(1, 1, 6)

现在我们可以简单地计算必要的填充

padding = torch.min(torch.tensor([v1.shape[-1], v2.shape[-1]])).item() - 1

，并且可以使用我没有时间来完成卷积

conv_result = temp = F.conv1d(v1, v2, padding=padding)

，但是它应该比您的初始double要快得多。

Pytorch has a batch analyzing tool called torch.nn.functional and there you have a conv1d function (obviously 2d as well and much much more). we will use conv1d.

Suppose you want to convolve 100 vectors given in v1 with 1 another vector given in v2. v1 has dimension of (minibatch , in channels , weights) and you need 1 channel by default. In addition, v2 has dimensions of * (\text{out_channels} , (out_channels,groups / in_channels,kW)*. You are using 1 channel and therefore 1 group so v1 and v2 will be given by:

import torch
from torch.nn import functional as F

num_vectors = 100
len_vectors = 9
v1 = torch.rand((num_vectors, 1, len_vectors))
v2 = torch.rand(1, 1, 6)

now we can simply compute the necessary padding via

padding = torch.min(torch.tensor([v1.shape[-1], v2.shape[-1]])).item() - 1

and the convolution can be done using

conv_result = temp = F.conv1d(v1, v2, padding=padding)

I did not time it but it should be considerably faster than your initial double for loop.

回复收藏 0 原文

月棠 2025-01-25 03:25:40

事实证明，有一种方法可以通过沿维度对输入进行分组而无需 for 循环来实现此目的：

out = torch.conv1d(x_batch.unsqueeze(0), y_batch.unsqueeze(1).flip(2), padding=y_batch.size(1)-1, groups=x_batch.size(0))
print(torch.allclose(out, res1))  # True

Turns out that there is a way to do it without for-loops via grouping of the inputs along a dimension:

out = torch.conv1d(x_batch.unsqueeze(0), y_batch.unsqueeze(1).flip(2), padding=y_batch.size(1)-1, groups=x_batch.size(0))
print(torch.allclose(out, res1))  # True

回复收藏 0 原文

~没有更多了~