Why is allocation on the heap faster than allocation on the stack?

Question

As far as my knowledge on resource management goes, allocating something on the heap (operator new) should always be slower than allocating on the stack (automatic storage), because the stack is a LIFO-based structure, thus it requires minimal bookkeeping, and the pointer of the next address to allocate is trivial.

So far, so good. Now look at the following code:

/* ...includes... */

using std::cout;
using std::cin;
using std::endl;

int bar() { return 42; }

int main()
{
    auto s1 = std::chrono::steady_clock::now();
    std::packaged_task<int()> pt1(bar);
    auto e1 = std::chrono::steady_clock::now();

    auto s2 = std::chrono::steady_clock::now();
    auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
    auto e2 = std::chrono::steady_clock::now();

    auto first = std::chrono::duration_cast<std::chrono::nanoseconds>(e1-s1);
    auto second = std::chrono::duration_cast<std::chrono::nanoseconds>(e2-s2);

    cout << "Regular: " << first.count() << endl
         << "Make shared: " << second.count() << endl;

    pt1();
    (*sh_ptr1)();

    cout << "As you can see, both are working correctly: " 
         << pt1.get_future().get() << " & " 
         << sh_ptr1->get_future().get() << endl;

    return 0;
}

The results seem to contradict the stuff explained above:

Regular: 6131

Make shared: 843

As you can see, both are working correctly: 42 & 42

Program ended with exit code: 0

In the second measurement, apart from the call of operator new, the constructor of the std::shared_ptr (auto sh_ptr1) has to finish. I can't seem to understand why is this faster then regular allocation.

What is the explanation for this?

Did you try swapping the order of actions? – Petr Jul 24 '15 at 15:04 — Petr, Jul 24 '15 at 15:04

score 29 · Accepted Answer · edited May 23 '17 at 12:07

The problem is that the first call to the constructor of std::packaged_task is responsible for initializing a load of per-thread state that is then unfairly attributed to pt1. This is a common problem of benchmarking (particularly microbenchmarking) and is alleviated by warmup; try reading How do I write a correct micro-benchmark in Java?

If I copy your code but run both parts first, the results are the same to within the limits of the resolution of the system clock. This demonstrates another issue of microbenchmarking, that you should run small tests multiple times to allow total time to be measured accurately.

With warmup and running each part 1000 times, I get the following (example):

Regular: 132.986
Make shared: 211.889

The difference (approx 80ns) accords well with the rule of thumb that malloc takes 100ns per call.

score 8 · Answer 2 · answered Jul 24 '15 at 15:11

It is a problem with your micro-benchmark: if you swap the order in which you measure the timing, you would get opposite results (demo).

It looks like the first-time call of std::packaged_task constructor causes a big hit. Adding an untimed

std::packaged_task<int()> ignore(bar);

before measuring the time fixes this problem (demo):

Regular: 505
Make shared: 937

score 5 · Answer 3 · answered Jul 24 '15 at 15:11

I've tried your example at ideone and got a result similar to yours:

Regular: 67950 
Make shared: 696

Then I reversed the order of tests:

auto s2 = std::chrono::steady_clock::now();
auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
auto e2 = std::chrono::steady_clock::now();

auto s1 = std::chrono::steady_clock::now();
std::packaged_task<int()> pt1(bar);
auto e1 = std::chrono::steady_clock::now();

and found an opposite result:

Regular: 548
Make shared: 68065

So that's not difference of stack vs heap, but difference of first and second call. Maybe you need to look into the internals of std::packaged_task.

Why is allocation on the heap faster than allocation on the stack?

3 Answers3