如何计算基于GPU的网络的理论推理时间?

发布于 2025-01-12 09:34:57 字数 1241 浏览 0 评论 0原文

我试图估计 GPU 在 DL 网络中进行推理需要多长时间。然而,在测试该方法时,理论计算时间和实际计算时间完全不同。

这是我目前正在做的事情:

我通过使用 https://github.com/Lyken17 获得了网络的 FLOPs /pytorch-OpCounter 如下:

macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2

获取 0.0184295 TFLOP。然后,计算我的 GPU (NVIDIA RTX A3000) 的 FLOPS:

4096 CUDA 核心 * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS

这给了我一个理论推理时间:

0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s

然后,我通过应用测量真实推理时间以下:

model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)    

start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))

# GPU warmup
for _ in range(10):
    _ = model(image)

# Measure performance
with torch.no_grad():
    for rep in range(reps):
        start.record()
        _ = model(image)
        end.record()
        # Wait for GPU to sync
        torch.cuda.synchronize()
        curr_time = start.elapsed_time(end)
        timings[rep] = curr_time

mean_syn = np.sum(timings) * 10 ** -3 / reps

这给了我 0.028 秒的实际计算时间。

你能帮我弄清楚我在这里做错了什么吗?

I am trying to estimate how long would a GPU take to make an inference in a DL network. However, when testing the method, the theoretical and real computing times turn out to be completely different.

Here is what I am currently doing:

I obtained the network's FLOPs by using https://github.com/Lyken17/pytorch-OpCounter as follows:

macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2

Obtaining 0.0184295 TFLOPs. Then, calculated the FLOPS for my GPU (NVIDIA RTX A3000):

4096 CUDA Cores * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS

Which gave me a theoretical inference time of:

0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s

Then, I measured the real inference time by applying the following:

model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)    

start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))

# GPU warmup
for _ in range(10):
    _ = model(image)

# Measure performance
with torch.no_grad():
    for rep in range(reps):
        start.record()
        _ = model(image)
        end.record()
        # Wait for GPU to sync
        torch.cuda.synchronize()
        curr_time = start.elapsed_time(end)
        timings[rep] = curr_time

mean_syn = np.sum(timings) * 10 ** -3 / reps

Which gave me a real computing time of 0.028 s.

Could you please help me figure out what I am doing wrong here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时间海 2025-01-19 09:34:57

我相信这样一个简单的计算永远不会给出正确的结果,ML模型中的推理时间取决于:

  • 计算能力/性能
  • 模型架构(即在NN中操作是逐层执行的)
  • 的同步/异步执行

CPU和GPU上 结论估计时间的最佳方法是进行一些实验/试验

有用的资源:
https://deci.ai/blog/measure-inference-time -深度神经网络/

I believe such a simple calculation will never give the correct result, the inference time in a ML model depends on:

  • Computing power/performance
  • Model architecture (i.e. in a NN operations are performed layer by layer)
  • Synchronous/Asynchronous execution on CPU and GPU

In conclusion the best way to estimate the time is to run some experiments/trials

Useful resources:
https://deci.ai/blog/measure-inference-time-deep-neural-networks/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文