This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.
(Hint: probably never.  Compilers already know how to optimally multiply integers, and understand the semantics of that operation.  Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away.  (The asm is not asm volatile, and you don't use the result.)  You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.
And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock().  Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc...  See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop.  And then it will be for that context, dependent on what optimizations you defeated by using inline asm.  (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it.  You need to measure latency, throughput, and front-end uop cost to properly characterize its cost.  Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context.  How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating.  gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20.  But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once.  (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)
So among other things, inline asm defeats constant propagation.  This matters even if only one of the inputs is a constant, and the other is a runtime variable.  Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2.  But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg.  Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part?  But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA.  Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization.  (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:
# clang7.0 -O3   (The functions both inline and optimize away)
main:                                   # @main
    push    rbx
    sub     rsp, 16
    call    clock
    mov     rbx, rax                 # save the return value
    call    clock
    sub     rax, rbx                 # end - start time
    cvtsi2sd        xmm0, rax
    divsd   xmm0, qword ptr [rip + .LCPI2_0]
    movsd   qword ptr [rsp + 8], xmm0 # 8-byte Spill
    call    clock
    mov     rbx, rax
    call    clock
    sub     rax, rbx             # same block again for the 2nd group.
    xorps   xmm0, xmm0
    cvtsi2sd        xmm0, rax
    divsd   xmm0, qword ptr [rip + .LCPI2_0]
    movsd   qword ptr [rsp], xmm0   # 8-byte Spill
    mov     edi, offset .L.str
    mov     al, 1
    movsd   xmm0, qword ptr [rsp + 8] # 8-byte Reload
    call    printf
    mov     edi, offset .L.str.1
    mov     al, 1
    movsd   xmm0, qword ptr [rsp]   # 8-byte Reload
    call    printf
    xor     eax, eax
    add     rsp, 16
    pop     rbx
    ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.