I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).
The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.
How would you parallelize a series of nested loops like these:
for(int i = 0; i < 3; ++i){
    for(int j = 0; j < HEIGHT; ++j){
        for(int k = 0; k < WIDTH; ++k){
            switch(i){
                case 0:
                        matrix[j][k].a = matrix[j][k] * someValue1;
                        break;
                case 1:
                        matrix[j][k].b = matrix[j][k] * someValue2;
                        break;   
                case 2:
                        matrix[j][k].c = matrix[j][k] * someValue3;                
                        break;
            }
        }
    }
}
- HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
- matrix is an array of custom structs with attributes a, b and c
- someValue is a double
I know that OpenMP is not always good for nested loops but any help is welcome.
[UPDATE]:
So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.
#pragma omp parallel
        {
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
        }
[UPDATE 2]
Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.
- Serial: ~130 milliseconds
- Loop unrolling: ~49 ms
- Collapsing two innermost loops: ~55 ms
- Parallel outermost loop: ~83 ms
What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?
 
     
     
     
    