Multi-GPU training using tf.slim takes more time than single GPU

Question

I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:

python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd

For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).

I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!

Timeline would help identify the performance bottleneck. Usage of timeline: http://stackoverflow.com/questions/36123740/is-there-a-way-of-determining-how-much-gpu-memory-is-in-use-by-tensorflow/37931964#37931964 — Yao Zhang, Apr 05 '17 at 06:42
@YaoZhang I've kept track of the GPU usage through nvidia-smi, and there are bursts of all 4 GPUs being used at around 90+% followed by moments of 0%, and chronically like this all throughout training. — Anas, Apr 05 '17 at 13:46
This is better answered if you file an issue on [Github](https://github.com/tensorflow/tensorflow/issues) — keveman, Apr 11 '17 at 16:48

bottlerun · Answer 1 · 2017-12-14T02:44:18.177

1

You will not get faster When set num_clones to use multi gpu. Because slim will train batch_size * num_clones data split in each of your GPU. After that calculate each loss by div num_clones and sum the total loss. (https://github.com/tensorflow/models/blob/master/research/slim/deployment/model_deploy.py)
When CPU become the bottleneck, input pipeline cannot product so much data for train. Then you will get 4 times slowly when set num_clones=4.(https://www.tensorflow.org/performance/performance_guide)

edited Dec 14 '17 at 02:44

answered Dec 13 '17 at 10:15

bottlerun

51
6

What can be done in this case then to speed up training? Thanks. – Anas Dec 15 '17 at 15:30
@Anas find the bottleneck first. Have a look at the second link I Posted. I'm learning to use timeline to profile now. You can try that too. – bottlerun Dec 20 '17 at 10:02

Multi-GPU training using tf.slim takes more time than single GPU

1 Answers1