The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it. 
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0:       55                      push   %rbp
4009f1:       48 89 e5                mov    %rsp,%rbp
4009f4:       89 7d fc                mov    %edi,-0x4(%rbp)
4009f7:       8b 7d fc                mov    -0x4(%rbp),%edi
4009fa:       0f af 7d fc             imul   -0x4(%rbp),%edi
4009fe:       89 f8                   mov    %edi,%eax
400a00:       5d                      pop    %rbp
400a01:       c3                      retq   
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
  400969:       e8 82 00 00 00          callq  4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10:       41 56                   push   %r14
400a12:       53                      push   %rbx
400a13:       50                      push   %rax
400a14:       48 8b 7e 08             mov    0x8(%rsi),%rdi
400a18:       31 f6                   xor    %esi,%esi
400a1a:       ba 0a 00 00 00          mov    $0xa,%edx
400a1f:       e8 9c fe ff ff          callq  4008c0 <strtol@plt>
400a24:       48 89 c3                mov    %rax,%rbx
400a27:       0f af db                imul   %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10:       41 56                   push   %r14
400a12:       53                      push   %rbx
400a13:       50                      push   %rax
400a14:       48 8b 7e 08             mov    0x8(%rsi),%rdi
400a18:       31 f6                   xor    %esi,%esi
400a1a:       ba 0a 00 00 00          mov    $0xa,%edx
400a1f:       e8 9c fe ff ff          callq  4008c0 <strtol@plt>
400a24:       48 89 c3                mov    %rax,%rbx
400a27:       0f af db                imul   %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl; 
36? nope, 42. It becomes 
std::cout << ++x * ++x << std::endl; 
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.