I am trying to benchmark three different applications. All of them are written in C++ using MPI and OpenMP and are compiled with gcc7.1 and OpenMPI3.0. I use a cluster with several nodes and 2 Intel CPUs with 24 cores. There is one process running on each node and on each node parallelization is done with OpenMP.
Edit: This is the shortest benchmark, I was testing custom reduction operations:
#include <mpi.h>
#include <omp.h>
#include <vector>
#include <chrono>
int process_id = -1;
std::vector<double> values(268435456, 0.1);
void sum(void *in, void *inout, int *len, MPI_Datatype *dptr){
    double* inv = static_cast<double*>(in);
    double* inoutv = static_cast<double*>(inout);   
    *inoutv = *inoutv + *inv;
} 
int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);   
  int mpi_world_size = 0;
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_world_size);   
  MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
  #pragma omp declare reduction(sum : double : omp_out = omp_out + omp_in) initializer(omp_priv = omp_orig)
  MPI_Op sum_mpi_op;
  MPI_Op_create( sum, 0, &sum_mpi_op );
  double tmp_result = 0.0;
  double result = 0.0;
  std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();
  #pragma omp parallel for simd reduction(sum:tmp_result)
  for(size_t counter = 0; counter < 268435456; ++counter){
    tmp_result = tmp_result + values[counter];
  }     
  MPI_Allreduce(&tmp_result, &result, sizeof(double), MPI_BYTE, sum_mpi_op, MPI_COMM_WORLD); 
  std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
  double seconds = std::chrono::duration<double>(timer_end - timer_start).count();
  if(process_id == 0){
    printf("Result: %.5f; Execution time: %.5fs\n", result, seconds);
  }
  MPI_Finalize();
  return EXIT_SUCCESS;
}
I observe that the execution time for all benchmarks varies between two values, e.g. for Benchmark A, I have 10 runs and 5 take about 0.6s and 5 take about 0.73s (+/- a bit). For Benchmark B it is the same but the exec time is either 77s or 85s (again +/-). Equivalent results for Benchmark C. So there is nothing in between. I measure the time with std::chrono:high_resolution_clock:
std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();
// do something
std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration<double>(timer_end - timer_start).count();
Slurm is used as a batch system and I use the exclusive option to make sure that there are no other jobs running on the nodes. For the Slurm job I use basically the following file:
 #!/bin/bash
 #SBATCH --ntasks 4
 #SBATCH --nodes 4
 #SBATCH --ntasks-per-node 1
 #SBATCH --exclusive
 #SBATCH --cpus-per-task 24
 export OMP_NUM_THREADS=24
 RUNS=10
 for ((i=1;i<=RUNS;i++)); do
   srun /path/bench_a
 done
For building the code I use CMake and set the flags
-O3 -DNDEBUG -march=haswell -DMPICH_IGNORE_CXX_SEEK -std=c++14
Since it is the same for all benchmarks, I don't believe the reason is the implementation, but something about the way I build the code or start the job.
Do you have any idea what I should be looking for to explain that behaviour? Thank you
 
    