I've been told several times, that I should use std::async for fire & forget type of tasks with the std::launch::async parameter (so it does it's magic on a new thread of execution preferably). 
Encouraged by these statements, I wanted to see how std::async is compared to:
- sequential execution
- a simple detached std::thread
- my simple async "implementation"
My naive async implementation looks like this:
template <typename F, typename... Args>
auto myAsync(F&& f, Args&&... args) -> std::future<decltype(f(args...))>
{
    std::packaged_task<decltype(f(args...))()> task(std::bind(std::forward<F>(f), std::forward<Args>(args)...));
    auto future = task.get_future();
    std::thread thread(std::move(task));
    thread.detach();
    return future;
}
Nothing fancy here, packs the functor f into an std::packaged task along with its arguments, launches it on a new std::thread which is detached, and returns with the std::future from the task.
And now the code measuring execution time with std::chrono::high_resolution_clock:
int main(void)
{
    constexpr unsigned short TIMES = 1000;
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        someTask();
    }
    auto dur = std::chrono::high_resolution_clock::now() - start;
    auto tstart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        std::thread t(someTask);
        t.detach();
    }
    auto tdur = std::chrono::high_resolution_clock::now() - tstart;
    std::future<void> f;
    auto astart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = std::async(std::launch::async, someTask);
    }
    auto adur = std::chrono::high_resolution_clock::now() - astart;
    auto mastart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = myAsync(someTask);
    }
    auto madur = std::chrono::high_resolution_clock::now() - mastart;
    std::cout << "Simple: " << std::chrono::duration_cast<std::chrono::microseconds>(dur).count() <<
    std::endl << "Threaded: " << std::chrono::duration_cast<std::chrono::microseconds>(tdur).count() <<
    std::endl << "std::sync: " << std::chrono::duration_cast<std::chrono::microseconds>(adur).count() <<
    std::endl << "My async: " << std::chrono::duration_cast<std::chrono::microseconds>(madur).count() << std::endl;
    return EXIT_SUCCESS;
}
Where someTask() is a simple method, where I wait a little, simulating some work done:
void someTask()
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
Finally, my results:
- Sequential: 1263615
- Threaded: 47111
- std::sync: 821441
- My async: 30784
Could anyone explain these results? It seems like std::aysnc is much slower than my naive implementation, or just plain and simple detached std::threads. Why is that? After these results is there any reason to use std::async? 
(Note that I did this benchmark with clang++ and g++ too, and the results were very similar)
UPDATE:
After reading Dave S's answer I updated my little benchmark as follows:
std::future<void> f[TIMES];
auto astart = std::chrono::high_resolution_clock::now();
for (int i = 0; i < TIMES; ++i)
{
    f[i] = std::async(std::launch::async, someTask);
}
auto adur = std::chrono::high_resolution_clock::now() - astart;
So the std::futures are now not destroyed - and thus joined - every run. After this change in the code, std::async produces similar results to my implementation & detached std::threads.
 
     
     
    