如何计算基于GPU的网络的理论推理时间?
我试图估计 GPU 在 DL 网络中进行推理需要多长时间。然而,在测试该方法时,理论计算时间和实际计算时间完全不同。
这是我目前正在做的事情:
我通过使用 https://github.com/Lyken17 获得了网络的 FLOPs /pytorch-OpCounter 如下:
macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2
获取 0.0184295 TFLOP。然后,计算我的 GPU (NVIDIA RTX A3000) 的 FLOPS:
4096 CUDA 核心 * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS
这给了我一个理论推理时间:
0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s
然后,我通过应用测量真实推理时间以下:
model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)
start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))
# GPU warmup
for _ in range(10):
_ = model(image)
# Measure performance
with torch.no_grad():
for rep in range(reps):
start.record()
_ = model(image)
end.record()
# Wait for GPU to sync
torch.cuda.synchronize()
curr_time = start.elapsed_time(end)
timings[rep] = curr_time
mean_syn = np.sum(timings) * 10 ** -3 / reps
这给了我 0.028 秒的实际计算时间。
你能帮我弄清楚我在这里做错了什么吗?
I am trying to estimate how long would a GPU take to make an inference in a DL network. However, when testing the method, the theoretical and real computing times turn out to be completely different.
Here is what I am currently doing:
I obtained the network's FLOPs by using https://github.com/Lyken17/pytorch-OpCounter as follows:
macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2
Obtaining 0.0184295 TFLOPs. Then, calculated the FLOPS for my GPU (NVIDIA RTX A3000):
4096 CUDA Cores * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS
Which gave me a theoretical inference time of:
0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s
Then, I measured the real inference time by applying the following:
model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)
start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))
# GPU warmup
for _ in range(10):
_ = model(image)
# Measure performance
with torch.no_grad():
for rep in range(reps):
start.record()
_ = model(image)
end.record()
# Wait for GPU to sync
torch.cuda.synchronize()
curr_time = start.elapsed_time(end)
timings[rep] = curr_time
mean_syn = np.sum(timings) * 10 ** -3 / reps
Which gave me a real computing time of 0.028 s.
Could you please help me figure out what I am doing wrong here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我相信这样一个简单的计算永远不会给出正确的结果,ML模型中的推理时间取决于:
CPU和GPU上 结论估计时间的最佳方法是进行一些实验/试验
有用的资源:
https://deci.ai/blog/measure-inference-time -深度神经网络/
I believe such a simple calculation will never give the correct result, the inference time in a ML model depends on:
In conclusion the best way to estimate the time is to run some experiments/trials
Useful resources:
https://deci.ai/blog/measure-inference-time-deep-neural-networks/