c++ speed comparison iterator vs index

Question

I am currently writing a linalg library in c++, for educational purposes and personal use. As part of it I implemented a custom matrix class with custom row and column iterators. While providing very nice feature to work with std::algorithm and std::numeric functions, I performed a speed comparison for a matrix multiplication between an index and iterator/std::inner_product approach. The results differ significantly:

// used later on for the custom iterator
template<class U>
struct EveryNth {
    bool operator()(const U& ) { return m_count++ % N == 0; }
    EveryNth(std::size_t i) : m_count(0), N(i) {}
    EveryNth(const EveryNth& element) : m_count(0), N(element.N) {}

private:
    int m_count;
    std::size_t N;
};

template<class T, 
         std::size_t rowsize, 
         std::size_t colsize>  
class Matrix
{

private:

    // Data is stored in a MVector, a modified std::vector
    MVector<T> matrix;

    std::size_t row_dim;                  
    std::size_t column_dim;

public:

    // other constructors, this one is for matrix in the computation
    explicit Matrix(MVector<T>&& s): matrix(s), 
                                     row_dim(rowsize), 
                                     column_dim(colsize){
    }    

    // other code...

    typedef boost::filter_iterator<EveryNth<T>, 
                                   typename std::vector<T>::iterator> FilterIter;

    // returns an iterator that skips elements in a range
    // if "to" is to be specified, then from has to be set to a value
    // @ param "j" - j'th column to be requested
    // @ param "from" - starts at the from'th element
    // @ param "to" - goes from the from'th element to the "to'th" element
    FilterIter  begin_col( std::size_t j,
                           std::size_t from = 0, 
                           std::size_t to = rowsize ){
        return boost::make_filter_iterator<EveryNth<T> >(
            EveryNth<T>( cols() ), 
            matrix.Begin() + index( from, j ), 
            matrix.Begin() + index( to, j )
            );
    }

    // specifies then end of the iterator
    // so that the iterator can not "jump" past the last element into undefines behaviour
    FilterIter end_col( std::size_t j, 
                        std::size_t to = rowsize ){
        return boost::make_filter_iterator<EveryNth<T> >(
            EveryNth<T>( cols() ), 
            matrix.Begin() + index( to, j ), 
            matrix.Begin() + index( to, j )
            );
    }

    FilterIter  begin_row( std::size_t i,
                           std::size_t from = 0,
                           std::size_t to = colsize ){
         return boost::make_filter_iterator<EveryNth<T> >(
            EveryNth<T>( 1 ), 
            matrix.Begin() + index( i, from ), 
            matrix.Begin() + index( i, to )
            );
    }

    FilterIter  end_row( std::size_t i,
                         std::size_t to = colsize ){
        return boost::make_filter_iterator<EveryNth<T> >(
            EveryNth<T>( 1 ), 
            matrix.Begin() + index( i, to ), 
            matrix.Begin() + index( i, to )
            );
    }

    // other code...

    // allows to access an element of the matrix by index expressed
    // in terms of rows and columns
    // @ param "r" - r'th row of the matrix
    // @ param "c" - c'th column of the matrix
    std::size_t index(std::size_t r, std::size_t c) const {
        return r*cols()+c; 
    }

    // brackets operator
    // return an elements stored in the matrix
    // @ param "r" - r'th row in the matrix
    // @ param "c" - c'th column in the matrix
    T& operator()(std::size_t r, std::size_t c) { 
        assert(r < rows() && c < matrix.size() / rows());
        return matrix[index(r,c)]; 
    }

    const T& operator()(std::size_t r, std::size_t c) const {
        assert(r < rows() && c < matrix.size() / rows()); 
        return matrix[index(r,c)]; 
    }

    // other code...

    // end of class
};

Now in the main function in run the following:

int main(int argc, char *argv[]){


    Matrix<int, 100, 100> a = Matrix<int, 100, 100>(range<int>(10000));


    std::clock_t begin = clock();
    double b = 0;
    for(std::size_t i = 0; i < a.rows(); i++){
        for (std::size_t j = 0; j < a.cols(); j++) {
                std::inner_product(a.begin_row(i), a.end_row(i), 
                                   a.begin_column(j),0);        
        }
    }

    // double b = 0;
    // for(std::size_t i = 0; i < a.rows(); i++){
    //     for (std::size_t j = 0; j < a.cols(); j++) {
    //         for (std::size_t k = 0; k < a.rows(); k++) {
    //             b += a(i,k)*a(k,j);
    //         }
    //     }
    // }


    std::clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << elapsed_secs << std::endl;

    std::cout << "--- End of test ---" << std::endl;

    std::cout << std::endl;
    return 0;
}

For the std::inner_product/iterator approach it takes:

bash-3.2$ ./main

3.78358
--- End of test ---

and for the index (// out) approach:

bash-3.2$ ./main

0.106173
--- End of test ---

which is almost 40 times faster then the iterator approach. Do you see anything in the code that could slow down iterator computation that much? I should mention that I tried out both methods and they produce correct results.

Thank you for your ideas.

It might belong to codereview ... you should also add your compiler, and all compiler options. — quantdev, Oct 11 '14 at 12:17
It looks like you are not using your iterators to iterate but are simply creating new ones each time. Or did I read it wrong? — Galik, Oct 11 '14 at 12:21
@Galik: updated the post. There occurred an error while copying. — Vincent, Oct 11 '14 at 12:31
clang++ -Wall -std=c++11 -stdlib=libc++ -o main main.cpp Vector.cpp Matrix.cpp on a mac osx mavericks — Vincent, Oct 11 '14 at 12:33
@Vincent please post your result after turning optimizations on. — , Oct 11 '14 at 13:07

score 2 · Accepted Answer · answered Oct 11 '14 at 18:10

What you have to understand is that matrix operations are VERY well-understood, and compilers are VERY good at doing optimizations of the things that are involved in matrix operations.

Consider C = AB, where C is MxN, A is MxQ, B is QxN.

double a[M][Q], b[Q][N], c[M][N];
for(unsigned i = 0; i < M; i++){
  for (unsigned j = 0; j < N; j++) {
    double temp = 0.0;
    for (unsigned k = 0; k < Q; k++) {
      temp += a[i][k]*b[k][j];
    }
    c[i][j] = temp;
  }
}

(You would not believe how tempted I was to write the above in FORTRAN IV.)

The compiler looks at this, and notices that what is really happening is that he is walking through a and c with a stride of 1 and b with a stride of Q. He eliminates the multiplications in the subscript calculations and does straight indexing.

At that point, the inner loop is of the form:

temp += a[r1] * b[r2];
r1 += 1;
r2 += Q;

And you have loops around that to (re)initialize r1 and r2 for each pass.

That is the absolute minimum computation you can do to do a straightforward matrix multiplication. You cannot do any less than that, because you have to do those multiplications and additions and index adjustments.

All you can do is add overhead.

That's what the iterator and std::inner_product() approach does: it adds metric tonnes of overhead.

Well I expected the index approach to be faster, just not as much. So I started to get worried. — Vincent, Oct 11 '14 at 21:11
Bear in mind that the semantics of the C/C++ for-loop require the compiler to emit code that re-evaluates the loop termination condition, in its entirety, on every pass through the loop. The compiler doesn't know that the loop body didn't reshape the matrix, so it has to check. If you know that the matrix size is fixed for the duration of the multiplication (and it darned well should be!), you can and should cache the values of, say, a.rows() and a.cols() in the initializer part of the for-stmt. — John R. Strohm, Oct 14 '14 at 21:09
for(std::size_t i = 0, std::size_t i_limit = a.rows(); i < i_limit; i++){} Note the comma between the two assignments in the init part, and the changed test. — John R. Strohm, Oct 15 '14 at 13:15

score 2 · Answer 2 · answered Oct 11 '14 at 23:33

This is just some additional information and general advice for low-level code optimization.

To conclusively find out where time is spent in low-level code (tight loops and hotspots),

You must be able to implement multiple versions of the code for computing the same result, using different implementation strategies.
- You will need broad mathematical and computational knowledge to do this.
You must inspect the disassembly (machine code).
You must also run your code under an instruction-level sampling profiler to see which part of the machine code is executed the most heavily (i.e. the hotspots).
- In order to collect a sufficient number of profiler samples, you will need to run the code in a tight loop, in millions or billions of times.
You must compare the disassembly of the hotspots between different versions of code (from different implementation strategies).
Based on the information above, you can arrive at the conclusion that some implementation strategies are less efficient (more wasteful or redundant) than others.
- If you arrive at this step, you can now publish and share your findings with others.

Some possibilities:

Using boost::filter_iterator to implement an iterator that skips every N element is wasteful. The internal implementation must increment by one at a time. If N is large, visiting the next element via boost::filter_iterator becomes an O(N) operation, as opposed to a simple iterator arithmetic which would be an O(1) operation.
Your boost::filter_iterator implementation uses the modulo operator. Although integer division and modulo operations are fast on modern CPUs, it is still not as fast as simple integer arithmetic.

Simply speaking,

Increments, decrements, additions and subtractions are fastest, for both integers and floating point.
Multiplication and bit shifts are slightly slower.
Divisions and modulo operations will bet yet slower.
Finally, floating point trigonometric and transcendental functions, especially those that require calling out to standard mathematical library functions, will be the slowest.

Thank you very much for your comment. I am still in the early process of development so in fact I am quite content that the iterator is working as it is. However I believe that this test revealed that there is quite a bit of work still ahead. I will post the results and code review when the time comes. — Vincent, Oct 12 '14 at 11:01

c++ speed comparison iterator vs index

2 Answers2

Linked