Measuring memory bandwidth from the dot product of two arrays

Question

The dot product of two arrays

for(int i=0; i<n; i++) {
    sum += x[i]*y[i];
}

does not reuse data so it should be a memory bound operation. Therefore, I should be able to measure the memory bandwidth from the dot product.

Using the code at why-vectorizing-the-loop-does-not-have-performance-improvement I get a bandwidth of 9.3 GB/s for my system. However, when I attempt to calculate the bandwidth using the dot product I get over twice the rate for a single thread and over three time the rate using multiple threads (my system has four cores/eight hyper-threads). This makes no sense to me since a memory bound operation should not benefit from multiple threads. Here is the output from the code below:

Xeon E5-1620, GCC 4.9.0, Linux kernel 3.13
dot 1 thread:      1.0 GB, sum 191054.81, time 4.98 s, 21.56 GB/s, 5.39 GFLOPS
dot_avx 1 thread   1.0 GB, sum 191043.33, time 5.16 s, 20.79 GB/s, 5.20 GFLOPS
dot_avx 2 threads: 1.0 GB, sum 191045.34, time 3.44 s, 31.24 GB/s, 7.81 GFLOPS
dot_avx 8 threads: 1.0 GB, sum 191043.34, time 3.26 s, 32.91 GB/s, 8.23 GFLOPS

Can somebody please explain to me why I get over twice the bandwidth for one thread and over three times the bandwidth using more than one thread?

Here is the code I used:

//g++ -O3 -fopenmp -mavx -ffast-math dot.cpp
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <x86intrin.h>
#include <omp.h>

extern "C" inline float horizontal_add(__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);
}

extern "C" float dot_avx(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    #pragma omp parallel reduction(+:sum)
    {
        __m256 sum1 = _mm256_setzero_ps();
        __m256 sum2 = _mm256_setzero_ps();
        __m256 sum3 = _mm256_setzero_ps();
        __m256 sum4 = _mm256_setzero_ps();
        __m256 x8, y8;
        #pragma omp for
        for(int i=0; i<n; i+=32) {
            x8 = _mm256_loadu_ps(&x[i]);
            y8 = _mm256_loadu_ps(&y[i]);
            sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1);
            x8 = _mm256_loadu_ps(&x[i+8]);
            y8 = _mm256_loadu_ps(&y[i+8]);
            sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2);
            x8 = _mm256_loadu_ps(&x[i+16]);
            y8 = _mm256_loadu_ps(&y[i+16]);
            sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3);
            x8 = _mm256_loadu_ps(&x[i+24]);
            y8 = _mm256_loadu_ps(&y[i+24]);
            sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4);
        }
        sum += horizontal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4)));
    }
    return sum; 
}

extern "C" float dot(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    for(int i=0; i<n; i++) {
        sum += x[i]*y[i];
    }
    return sum;
}

int main(){
    uint64_t LEN = 1 << 27;
    float *x = (float*)_mm_malloc(sizeof(float)*LEN,64);
    float *y = (float*)_mm_malloc(sizeof(float)*LEN,64);
    for(uint64_t i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}

    uint64_t size = 2*sizeof(float)*LEN;

    volatile float sum = 0;
    double dtime, rate, flops;  
    int repeat = 100;

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;
    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);

    sum = 0;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot_avx(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;

    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);
}

I just downloaded, complied, and ran STREAM as suggested by Jonathan Dursi and here are the results:

One thread

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14292.1657       0.0023       0.0022       0.0023
Scale:      14286.0807       0.0023       0.0022       0.0023
Add:        14724.3906       0.0033       0.0033       0.0033
Triad:      15224.3339       0.0032       0.0032       0.0032

Eight threads

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       24501.2282       0.0014       0.0013       0.0021
Scale:      23121.0556       0.0014       0.0014       0.0015
Add:        25263.7209       0.0024       0.0019       0.0056
Triad:      25817.7215       0.0020       0.0019       0.0027

How many physical CPUs do you have? How are your memory channels populated? — David Schwartz, Aug 07 '14 at 10:12
@DavidSchwartz, I have a single socket system (i.e. it's not a NUMA system) with a Xeon E5-1620 processor which has four cores (eight hyper threads). I don't know how the memory channels are populated. — Z boson, Aug 07 '14 at 11:13
@DavidSchwartz, my system has 32 GB of memory. It has two 16 GB DDR3, registered, ECC, 1600 MHz, PC3-12,800 modules. — Z boson, Aug 07 '14 at 11:22
I hope you write this whole project up at some point. Here, the issue is just that one thread isn't completely saturating the memory subsystem - which isn't necessarily the same as saying that there's still room for improving the performance of single-thread performance. With prefetching, and having multiple memory requests in flight at once, there may be operands that are ready to be dot-producted but which aren't the ones that the first thread are expecting. You've probably already looked at [this ref](http://www.akkadia.org/drepper/cpumemory.pdf) - it's a little old now but comprehensive. — Jonathan Dursi, Aug 07 '14 at 12:12
@JonathanDursi, as I understand it my dot product code has to read 1 GB of memory and it's doing it much faster than the code using `memset` at https://stackoverflow.com/questions/18159455/why-vectorizing-the-loop-does-not-have-performance-improvement/18159503#18159503. So either that code does not measure the bandwidth correctly and my real bandwidth is much higher or I don't understand something. From [wikipedia](https://en.wikipedia.org/wiki/List_of_device_bit_rates#Dynamic_random_access_memory) the bandwidth for my memory is 12.8 GB/s which is close to what the `memset` code gets. — Z boson, Aug 07 '14 at 12:29
One big difference between my dot product code and the `memset` code is that my code only reads memory and the `memset` code only writes memory. So maybe the bandwidth to read is higher. Though that does not explain why multiple threads help it could explain why reading one 1GB is faster than writing 1GB. — Z boson, Aug 07 '14 at 12:33
There's two issues here - the absolute numbers, and why they aren't pegged at one thread. I was thinking the question was more the second here in my previous comments. As to the absolute numbers - the 12.8GB/s is the bandwidth for one module; you have two, so the total available bandwidth out of the DIMMs is potentially twice that. If you run the STREAM benchmark on your system, what numbers do you get? — Jonathan Dursi, Aug 07 '14 at 12:36
While memset is good for a quick test, there's no reason why it has to be a best-possible number (that's why I ask about stream). — Jonathan Dursi, Aug 07 '14 at 12:46
@JonathanDursi, I added the results of STREAM to the end of my question. The rate is about 25 GB/s. That's better than the rate I get with one thread but still less than with 2 or more threads. — Z boson, Aug 07 '14 at 12:49
@JonathanDursi, but when I run mbw (Memory BandWidth benchmark) it reports about 9.4 GB/s. — Z boson, Aug 07 '14 at 12:58
So these numbers don't seem inconsistent; stream triad is like your dot product, but with doubles (and it's a(i) = b(i) + q*c(i)), it's not inconceivable to me that there'd be a slowdown due to DP operations and not being a simple FMA. Your peak of ~30GB/s is about the bandwidth of the integrated memory controller, so it's not like the numbers are physically impossible. — Jonathan Dursi, Aug 07 '14 at 13:36
...to say nothing of the fact that it's also doing the writes (which as you mention are _slightly_ slower than the reads, but not enormously). — Jonathan Dursi, Aug 07 '14 at 13:49
@JonathanDursi, I guess I need to read "What Every Programmer Should Know About Memory". I have tried going through it a few times in the past but it's 114 pages... — Z boson, Aug 07 '14 at 14:03
I'm going to try to distill some of this conversation into an answer... — Jonathan Dursi, Aug 07 '14 at 15:32
I've also found that memory bandwidth is more difficult to predict and measure. First you have a clear difference between read and write bandwidth. On some systems you can get the full bandwidth on both since they use different channels. Then it also matters whether or not you stream. If you don't stream writes, they will also incur a read cost. And unlike caches and other internal CPU bottlenecks, scaling up the demand on bandwidth doesn't result in "cliffs" in the performance graph. You see smooth diminishing returns instead. — Mysticial, Aug 07 '14 at 18:30

score 13 · Accepted Answer · answered Aug 07 '14 at 16:20

There's a few things going on here, that come down to:

You have to work fairly hard to get every last bit of performance out of the memory subsystem; and
Different benchmarks measure different things.

The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

------------------------------------------------------------------
name        kernel                  bytes/iter      FLOPS/iter
------------------------------------------------------------------
COPY:       a(i) = b(i)                 16              0
SCALE:      a(i) = q*b(i)               16              1
SUM:        a(i) = b(i) + c(i)          24              1
TRIAD:      a(i) = b(i) + q*c(i)        24              2
------------------------------------------------------------------

so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

I guess I have my own benchmark now: the dot product :-) I must admit that I'm surprised that multiples threads help in this case. I have observed this several times in the past but did not believe the results because it conflicted with my naïve view of how a CPU works. I assumed that the CPU was waiting for data and another CPU would not help. But if the one CPU is waiting for a particular set of data (and not any set) and the other CPU another particular set then I can understand how multiple threads could help. — Z boson, Aug 08 '14 at 08:41
I made my own memory bandwidth benchmarking code https://github.com/zboson/bandwidth. I posted some results to an answer to my question. — Z boson, Aug 08 '14 at 11:57

score 4 · Answer 2 · answered Aug 08 '14 at 11:51

I made my own memory benchmark code https://github.com/zboson/bandwidth

Here are the current results for eight threads:

write:    0.5 GB, time 2.96e-01 s, 18.11 GB/s
copy:       1 GB, time 4.50e-01 s, 23.85 GB/s
scale:      1 GB, time 4.50e-01 s, 23.85 GB/s
add:      1.5 GB, time 6.59e-01 s, 24.45 GB/s
mul:      1.5 GB, time 6.56e-01 s, 24.57 GB/s
triad:    1.5 GB, time 6.61e-01 s, 24.37 GB/s
vsum:     0.5 GB, time 1.49e-01 s, 36.09 GB/s, sum -8.986818e+03
vmul:     0.5 GB, time 9.00e-05 s, 59635.10 GB/s, sum 0.000000e+00
vmul_sum:   1 GB, time 3.25e-01 s, 33.06 GB/s, sum 1.910421e+04

Here are the currents results for 1 thread:

write:    0.5 GB, time 4.65e-01 s, 11.54 GB/s
copy:       1 GB, time 7.51e-01 s, 14.30 GB/s
scale:      1 GB, time 7.45e-01 s, 14.41 GB/s
add:      1.5 GB, time 1.02e+00 s, 15.80 GB/s
mul:      1.5 GB, time 1.07e+00 s, 15.08 GB/s
triad:    1.5 GB, time 1.02e+00 s, 15.76 GB/s
vsum:     0.5 GB, time 2.78e-01 s, 19.29 GB/s, sum -8.990941e+03
vmul:     0.5 GB, time 1.15e-05 s, 468719.08 GB/s, sum 0.000000e+00
vmul_sum:   1 GB, time 5.72e-01 s, 18.78 GB/s, sum 1.910549e+04

write: writes a constant (3.14159) to an array. This should be like memset.
copy, scale, add, and triad are defined the same as in STREAM
mul: a(i) = b(i) * c(i)
vsum: sum += a(i)
vmul: sum *= a(i)
vmul_sum: sum += a(i)*b(i) // the dot product

My results are consistent with STREAM. I get the highest bandwidth for vsum. The vmul method does not work currently (once the value is zero it finishes early). I can get slightly better results (by about 10%) using intrinsics and unrolling the loop which I will add later.

I get a bit better results by binding the threads (`export OMP_PROC_BIND=true`) and by setting the number of threads to the number of physical cores (i.e. not using hyper-threading) e.g. vsum goes to nearlly 39 GB/s (from 36 GB/s). — Z boson, Aug 08 '14 at 12:54

Measuring memory bandwidth from the dot product of two arrays

2 Answers2

Linked