본문 바로가기

컴퓨터/머신러닝 (Machine Learning)

Pytorch distributed launch watchdog timeout 에러 해결

[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803170 milliseconds before timing out

Ubuntu 20.04

BSRGAN, HAT 모델 학습 중 확인

 

https://github.com/cszn/BSRGAN

 

GitHub - cszn/BSRGAN: Designing a Practical Degradation Model for Deep Blind Image Super-Resolution (ICCV, 2021) (PyTorch) - We

Designing a Practical Degradation Model for Deep Blind Image Super-Resolution (ICCV, 2021) (PyTorch) - We released the training code! - GitHub - cszn/BSRGAN: Designing a Practical Degradation Model...

github.com

 

https://github.com/XPixelGroup/HAT

 

GitHub - XPixelGroup/HAT: Arxiv2022 - Activating More Pixels in Image Super-Resolution Transformer

Arxiv2022 - Activating More Pixels in Image Super-Resolution Transformer - GitHub - XPixelGroup/HAT: Arxiv2022 - Activating More Pixels in Image Super-Resolution Transformer

github.com

 

https://github.com/WongKinYiu/yolov7/issues/714

 

Watchdog caught collective operation timeout · Issue #714 · WongKinYiu/yolov7

Hi all, I am trying to train the yolov7 model with --multi-scale , I was training it for 20 epochs and my batch size was 4. Also I was using 4 RTX 3080 GPUs for multi-GPU training. training command...

github.com

 

학습은 잘 되다가, validation process에서 갑자기 아래의 에러와 함께 학습이 종료될 때가 있다.

 

[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803170 milliseconds before timing out

 

이는 pytorch에서 분산 학습에 사용되는 torch.distributed.init_process_group 함수의 default timeout 값이 1800초 이기 때문

https://pytorch.org/docs/stable/distributed.html

 

Distributed communication package - torch.distributed — PyTorch 1.13 documentation

Shortcuts

pytorch.org

 

따라서 아래 사진의 가장 아랫줄의 dist.init_process_group(backend=backend, **kwargs) 부분을 수정

 

 

def _init_dist_pytorch(backend, **kwargs):
    rank = int(os.environ['RANK'])
    num_gpus = torch.cuda.device_count()
    torch.cuda.set_device(rank % num_gpus)
    dist.init_process_group(backend=backend, timeout=datetime.timedelta(seconds=18000), **kwargs)