I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:
python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd  
For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).
I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!