Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
    // Program entry point
    std::cout << "Generating data..." << std::endl;
    // Create a vector containing many variables
    std::vector<double> data;
    for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
    // Calculate mean using 1 core
    double mean = 0;
    std::cout << "Calculating mean, 1 Thread..." << std::endl;
    findmean(&data, 0, data.size(), &mean);
    mean /= (double)data.size();
    // Print result
    std::cout << "  Mean=" << mean << std::endl;
    // Repeat, using two threads
    std::vector<std::thread> thread;
    std::vector<double> result;
    result.push_back(0.0);
    result.push_back(0.0);
    std::cout << "Calculating mean, 2 Threads..." << std::endl;
    // Run threads
    uint32_t halfsize = data.size() / 2;
    uint32_t A = 0;
    uint32_t B, C, D;
    // Split the data into two blocks
    if(data.size() % 2 == 0)
    {
        B = C = D = halfsize;
    }
    else if(data.size() % 2 == 1)
    {
        B = C = halfsize;
        D = hsz + 1;
    }
    // Run with two threads
    thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
    thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
    // Join threads
    thread[0].join();
    thread[1].join();
    // Calculate result
    mean = result[0] + result[1];
    mean /= (double)data.size();
    // Print result
    std::cout << "  Mean=" << mean << std::endl;
    // Return
    return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
    for(uint32_t i = 0; i < length; i ++) {
        *result += (*datavec).at(start + i);
    }
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
    holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds