multi nodes GPU training issue: Connection reset by peer · facebookresearch/maskrcnn-benchmark#315

(27 comments) (0 reactions) (0 assignees)Python (9,161 stars) (2,574 forks)batch import

contributions welcomeenhancementhelp wanted

説明

❓ Questions and Help

I follow the example of repository and it's all right with single GPU training and Multi-GPU training within one machine. however, when I try to training with two machines by two nodes, issues occur as below:

Traceback (most recent call last): File "./tools/train_net.py", line 177, in File "./tools/train_net.py", line 177, in main() File "./tools/train_net.py", line 146, in main main() File "./tools/train_net.py", line 146, in main backend="nccl", init_method="env://" File "/home/ubuntu/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/distributed/deprecated/init.py", line 101, in init_process_group backend="nccl", init_method="env://" File "/home/ubuntu/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/distributed/deprecated/init.py", line 101, in init_process_group group_name, rank) RuntimeError: Connection reset by peer at /opt/conda/conda-bld/pytorch-nightly_1542100572345/work/torch/lib/THD/process_group/General.cpp:20 group_name, rank) RuntimeError: Connection reset by peer at /opt/conda/conda-bld/pytorch-nightly_1542100572345/work/torch/lib/THD/process_group/General.cpp:20

I add below's codes into train_net.py which will run on first machine.

os.environ["CUDA_VISIBLE_DEVICES"]="0,1" #for multi nodes training os.environ['MASTER_ADDR'] = '192.168.43.100' os.environ['MASTER_PORT'] = '9997' os.environ['WORLD_SIZE'] = '4' os.environ['RANK'] = '0'

I add below's codes into train_net.py which will run on second machine.

Then I built the same environment on my machines and run the followings code respectively.

python -m torch.distributed.launch --nproc_per_node=2 ./tools/train_n et.py --config-file "./configs/e2e_mask_rcnn_R_101_FPN_1x.yaml"

for the machine with node rank0, the code is waiting to start training all the time. like this:

.... PRE_NMS_TOP_N_TRAIN: 2000 RPN_HEAD: SingleConvRPNHead STRADDLE_THRESH: 0 USE_FPN: True RPN_ONLY: False WEIGHT: /home/share/maskrcnn-benchmark/R-101.pkl OUTPUT_DIR: . PATHS_CATALOG: /home/share/maskrcnn-benchmark/maskrcnn_benchmark/config/paths_catalog.py SOLVER: BASE_LR: 0.0001 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 2500 GAMMA: 0.1 IMS_PER_BATCH: 8 MAX_ITER: 90000 MOMENTUM: 0.9 STEPS: (60000, 80000) WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0 TEST: EXPECTED_RESULTS: [] EXPECTED_RESULTS_SIGMA_TOL: 4 IMS_PER_BATCH: 8

for the machine with node rank1 which occur the issue Connection reset by peer mentioned before.

Dose anyone know how to solve this problem and how to use the distributed training among several machines?

コントリビューターガイド

技術スタック: pythonpytorch
領域: machine learning
Issue 種別: bug
難度: 4
推定時間: 3-5 days
活動状況: stale
明確さ: needs investigation
前提条件: PyTorch basicsDistributed training conceptsNetwork configuration
初心者向け度: 10
調査方針: The issue reports a 'Connection reset by peer' error during multi node distributed training with NCCL backend. Investigation should focus on verifying network connectivity between the two machines (check MASTER ADDR reachability and firewall rules), ensuring consistent PyTorch versions and NCCL compatibility, and reviewing the environment variable settings (MASTER PORT, WORLD SIZE). Refer to the traceback in General.cpp and similar resolved issues in the repository. The user has provided logs; test with a minimal distributed script to isolate the problem.