I was writing a simple subroutine in Fortran for dot product of vectors using OpenMP reduction parallelization. However its results are significantly different then dot_product and non-openMP do loop summation. Please see the code below:
program main
  use iso_fortran_env
  implicit none
  real(real64), allocatable  :: x(:)
  real(real64)               :: x_dot
  integer(int32)             :: i, n
  n = 1000000
  allocate(x(n))
  x = 0.3_real64
  print *, "sum:     ", dot_product(x,x)
  x_dot = 0.0_real64
  do i = 1, n
    x_dot = x_dot + x(i)*x(i)
  enddo
  print *, "loop:    ", x_dot
  x_dot= 0.0_real64
  !$omp parallel do private(i) reduction(+:x_dot)
  do i = 1, n
    x_dot = x_dot + x(i)*x(i)
  enddo
  !$omp end parallel do
  print *,"omp red: ", x_dot
end program main
output generated:
sum:        89999.999997827967
loop:       89999.999997827967
omp red:    90000.000000129017
As you can see with omp reduction we have difference in single precision level. Am i getting some race condition somewhere?
gfortran 7.5
compilation flags: -O3 -fopenmp
OMP_NUM_THREADS=4