Check the following code:
#include <stdio.h>
#include <omp.h>
#define ARRAY_SIZE  (1024)
float A[ARRAY_SIZE];
float B[ARRAY_SIZE];
float C[ARRAY_SIZE];
int main(void)
{   
    for (int i = 0; i < ARRAY_SIZE; i++)
    {
        A[i] = i * 2.3;
        B[i] = i + 4.6;
    }
    double start = omp_get_wtime();
    for (int loop = 0; loop < 1000000; loop++)
    {
        #pragma omp simd
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            C[i] = A[i] * B[i];
        }
    }
    double end = omp_get_wtime();
    printf("Work consumed %f seconds\n", end - start);
    return 0;
}
Build and run it on my machine, it outputs:
$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.084107 seconds
If I comment out "#pragma omp simd", build and run it again:  
$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.112724 seconds
We can see "#pragma omp simd" doesn't get big performance gain. But if I add -O2 option, no "#pragma omp simd":  
$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.446662 seconds
With "#pragma omp simd":
$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126799 seconds
We can see a big improvement. But if use -O3, no "#pragma omp simd": 
$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.127563 seconds
with "#pragma omp simd":
$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126727 seconds
We can see the results are similar again.
Why does "#pragma omp simd" only take big performance improvement in -O2 under gcc compiler?
 
    