Huggingface变形金刚是每个GPU设置的参数吗?还是整个GPU?并且此答案是否会根据培训是否在?
例如,如果我有一台具有4个GPU和48个CPU的计算机(仅运行此培训任务),设置DataLoader_Num_workers是否有任何期望值大于 12 < / code>(48/4)?还是他们都会开始争夺相同的资源?
据我了解,在DDP模式下运行(使用 torch.distributed.launch
或类似)时,一个培训过程会管理每个设备,但是在默认的DP模式下,一个领导过程可以管理所有设备。因此,也许答案是DDP的 12
,但DP的 〜47
?
Should the HuggingFace transformers TrainingArguments dataloader_num_workers
argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12
(48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch
or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12
for DDP but ~47
for DP?
发布评论