facebookresearch/maskrcnn-benchmark

multi nodes GPU training issue: Connection reset by peer

Open

#315 opened on 2019年1月2日

GitHub で見る
 (27 comments) (0 reactions) (0 assignees)Python (9,161 stars) (2,574 forks)batch import
contributions welcomeenhancementhelp wanted

説明

❓ Questions and Help

I follow the example of repository and it's all right with single GPU training and Multi-GPU training within one machine. however, when I try to training with two machines by two nodes, issues occur as below:

Traceback (most recent call last): File "./tools/train_net.py", line 177, in File "./tools/train_net.py", line 177, in main() File "./tools/train_net.py", line 146, in main main() File "./tools/train_net.py", line 146, in main backend="nccl", init_method="env://" File "/home/ubuntu/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/distributed/deprecated/init.py", line 101, in init_process_group backend="nccl", init_method="env://" File "/home/ubuntu/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/distributed/deprecated/init.py", line 101, in init_process_group group_name, rank) RuntimeError: Connection reset by peer at /opt/conda/conda-bld/pytorch-nightly_1542100572345/work/torch/lib/THD/process_group/General.cpp:20 group_name, rank) RuntimeError: Connection reset by peer at /opt/conda/conda-bld/pytorch-nightly_1542100572345/work/torch/lib/THD/process_group/General.cpp:20

I add below's codes into train_net.py which will run on first machine.

os.environ["CUDA_VISIBLE_DEVICES"]="0,1" #for multi nodes training os.environ['MASTER_ADDR'] = '192.168.43.100' os.environ['MASTER_PORT'] = '9997' os.environ['WORLD_SIZE'] = '4' os.environ['RANK'] = '0'

I add below's codes into train_net.py which will run on second machine.

os.environ["CUDA_VISIBLE_DEVICES"]="0,1" #for multi nodes training os.environ['MASTER_ADDR'] = '192.168.43.100' os.environ['MASTER_PORT'] = '9997' os.environ['WORLD_SIZE'] = '4' os.environ['RANK'] = '1'

Then I built the same environment on my machines and run the followings code respectively.

python -m torch.distributed.launch --nproc_per_node=2 ./tools/train_n et.py --config-file "./configs/e2e_mask_rcnn_R_101_FPN_1x.yaml"

for the machine with node rank0, the code is waiting to start training all the time. like this:

.... PRE_NMS_TOP_N_TRAIN: 2000 RPN_HEAD: SingleConvRPNHead STRADDLE_THRESH: 0 USE_FPN: True RPN_ONLY: False WEIGHT: /home/share/maskrcnn-benchmark/R-101.pkl OUTPUT_DIR: . PATHS_CATALOG: /home/share/maskrcnn-benchmark/maskrcnn_benchmark/config/paths_catalog.py SOLVER: BASE_LR: 0.0001 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 2500 GAMMA: 0.1 IMS_PER_BATCH: 8 MAX_ITER: 90000 MOMENTUM: 0.9 STEPS: (60000, 80000) WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0 TEST: EXPECTED_RESULTS: [] EXPECTED_RESULTS_SIGMA_TOL: 4 IMS_PER_BATCH: 8

for the machine with node rank1 which occur the issue Connection reset by peer mentioned before.

Dose anyone know how to solve this problem and how to use the distributed training among several machines?

コントリビューターガイド

multi nodes GPU training issue: Connection reset by peer · facebookresearch/maskrcnn-benchmark#315 | Good First Issue