I don't know how closely your sample code matches your application, but if you are looping over rows like that, you are almost certainly running into cache problems. If I code your loops in row-major and column-major order, I see drastic performance differences.
With nrow=1000000 and ncol=1000, if I use array[i][0], I get a runtime of about 1.9 s. If I use array[0][i], then it drops to 0.05s.
If it's possible for you to transpose your data in this way, you should see a large performance boost.
#ifdef COL_MAJOR    
    array = (double **)malloc(nrow * sizeof(double *));
    for(i=0; i<nrow; i++) {
        array[i] = (double *)malloc(ncol * sizeof(double));
        array[i][0] = i;
    }
    for(i=0; i<nrow; i++) {
        sum += array[i][0];
    }
    for(i=0; i<nrow; i++) {
        array[i][0] /= sum;
    }
#else
    array = (double **)malloc(ncol * sizeof(double *));
    for(i=0; i<ncol; i++) {
        array[i] = (double *)malloc(nrow * sizeof(double));
    }
    for(i=0; i<nrow; i++) {
        array[0][i] = i;
    }
    for(i=0; i<nrow; i++) {
        sum += array[0][i];
    }
    for(i=0; i<nrow; i++) {
        array[0][i] /= sum;
    }
#endif
printf("%f\n", sum);
$ gcc -DCOL_MAJOR -O2 -o normed normed.c
$ time ./normed
499999500000.000000
real    0m1.904s
user    0m0.325s
sys 0m1.575s
$ time ./normed
499999500000.000000
real    0m1.874s
user    0m0.304s
sys 0m1.567s
$ time ./normed
499999500000.000000
real    0m1.873s
user    0m0.296s
sys 0m1.573s
$ gcc -O2 -o normed normed.c
$ time ./normed
499999500000.000000
real    0m0.051s
user    0m0.017s
sys 0m0.024s
$ time ./normed
499999500000.000000
real    0m0.050s
user    0m0.017s
sys 0m0.023s
$ time ./normed
499999500000.000000
real    0m0.051s
user    0m0.014s
sys 0m0.022s
$