如何计算基于GPU的网络的理论推理时间？

发布于 2025-01-12 09:34:57 字数 1241 浏览 0 评论 0原文

我试图估计 GPU 在 DL 网络中进行推理需要多长时间。然而，在测试该方法时，理论计算时间和实际计算时间完全不同。

这是我目前正在做的事情：

我通过使用 https://github.com/Lyken17 获得了网络的 FLOPs /pytorch-OpCounter 如下：

macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2

获取 0.0184295 TFLOP。然后，计算我的 GPU (NVIDIA RTX A3000) 的 FLOPS：

4096 CUDA 核心 * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS

这给了我一个理论推理时间：

0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s

然后，我通过应用测量真实推理时间以下：

model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)    

start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))

# GPU warmup
for _ in range(10):
    _ = model(image)

# Measure performance
with torch.no_grad():
    for rep in range(reps):
        start.record()
        _ = model(image)
        end.record()
        # Wait for GPU to sync
        torch.cuda.synchronize()
        curr_time = start.elapsed_time(end)
        timings[rep] = curr_time

mean_syn = np.sum(timings) * 10 ** -3 / reps

这给了我 0.028 秒的实际计算时间。

你能帮我弄清楚我在这里做错了什么吗？

原文

I am trying to estimate how long would a GPU take to make an inference in a DL network. However, when testing the method, the theoretical and real computing times turn out to be completely different.

Here is what I am currently doing:

I obtained the network's FLOPs by using https://github.com/Lyken17/pytorch-OpCounter as follows:

macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2

Obtaining 0.0184295 TFLOPs. Then, calculated the FLOPS for my GPU (NVIDIA RTX A3000):

4096 CUDA Cores * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS

Which gave me a theoretical inference time of:

0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s

Then, I measured the real inference time by applying the following:

model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)    

start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))

# GPU warmup
for _ in range(10):
    _ = model(image)

# Measure performance
with torch.no_grad():
    for rep in range(reps):
        start.record()
        _ = model(image)
        end.record()
        # Wait for GPU to sync
        torch.cuda.synchronize()
        curr_time = start.elapsed_time(end)
        timings[rep] = curr_time

mean_syn = np.sum(timings) * 10 ** -3 / reps

Which gave me a real computing time of 0.028 s.

Could you please help me figure out what I am doing wrong here?

分享到QQ

分享到微博