In the ARM procedure call standard for the 32-bit architecture, it is possible to return a struct with a single data member in a register, rather than on the stack.
The size of a C++ unique_ptr is only the size of a single pointer, so I would expect it to be returned in a register, too. But, at least with gcc, that's not the case.
I fail to understand what it is with unique_ptr that defeats this possible optimization? Could unique_ptr be implemented in such a way as to take advantage of the small struct optimization, or is there an insurmountable obstacle?
The most important downside of having unique_ptr returned on the stack, is that an additional register is consumed in the procedure call to hold the address of the location where to store the result. Hence I conjecture that there would be room for some optimization, at least on the ARM platform, but possibly for others, too.