使用 DistributedDataParallel 的多节点处理给我一个权限被拒绝的错误

发布于 2025-01-18 09:10:18 字数 1606 浏览 2 评论 0原文

我正在尝试在大学超级计算机上使用 Pytorch 的 DistributedDataParallel 实现多节点作业,我使用端口 22 通过 ssh 登录。 " rel="nofollow noreferrer">教程,当我设置MASTER_PORT=12340 或 SLURM 脚本上的其他数字,我显然没有得到任何响应,因为上面没有发生任何事情。如果我设置 MASTER_PORT=22,当代码到达 dist.init_process_group() 方法时,我会得到权限被拒绝:

dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,
                                world_size=opt.world_size, rank=opt.rank)

给我

Traceback (most recent call last):
  File "train_dist.py", line 262, in <module>
    main()
  File "train_dist.py", line 220, in main
    world_size=opt.world_size, rank=opt.rank)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:22 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:22 (errno: 13 - Permission denied).

我也尝试将端口 22 流量重新路由到其他端口(例如 65000),但我也得到即使尝试此重新路由也被拒绝。我不确定此时我还可以尝试做什么,有人有任何建议吗?或者这是我需要向管理员询问的事情吗?

I'm trying to implement a multi-node job using Pytorch's DistributedDataParallel on a University supercomputer where I'm logging in via ssh using port 22. Following this tutorial, when I set MASTER_PORT=12340 or some other number on the SLURM script, I obviously get no response since there's nothing happening on it. If I set MASTER_PORT=22, I get permission denied when the code reaches the dist.init_process_group() method:

dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,
                                world_size=opt.world_size, rank=opt.rank)

GIVES ME

Traceback (most recent call last):
  File "train_dist.py", line 262, in <module>
    main()
  File "train_dist.py", line 220, in main
    world_size=opt.world_size, rank=opt.rank)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:22 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:22 (errno: 13 - Permission denied).

I have also tried to re-route the port 22 traffic to some other port (eg. 65000) but I also get permission denied for even attempting this rerouting. I'm not sure what else I can try to do at this point, anyone has any suggestions or is this something that I need to ask the administrator for?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文