It is entirely possible that in most implementations, the cost of a memmove() function call will not be significantly greater than memcpy() in any scenario in which the behavior of both is defined.  There are two points not yet mentioned, though:
- In some implementations, the determination of address overlap may be expensive.  There is no way in standard C to determine whether the source and destination objects point to the same allocated area of memory, and thus no way that the greater-than or less-than operators can be used upon them without spontaneously causing cats and dogs to get along with each other (or invoking other Undefined Behavior).  It is likely that any practical implementation will have some efficient means of determining whether or not the pointers overlap, but the standard doesn't require that such a means exist.  A memmove() function written entirely in portable C would on many platforms probably take at least twice as long to execute as would a memcpy() also written entirely in portable C.
- Implementations are allowed to expand functions in-line when doing so would not alter their semantics.  On an 80x86 compiler, if the ESI and EDI registers don't happen to hold anything important, a memcpy(src, dest, 1234) could generate code:
  mov esi,[src]
  mov edi,[dest]
  mov ecx,1234/4 ; Compiler could notice it's a constant
  cld
  rep movsl
 This would take the same amount of in-line code, but run much faster than:
  push [src]
  push [dest]
  push dword 1234
  call _memcpy
  ...
_memcpy:
  push ebp
  mov  ebp,esp
  mov  ecx,[ebp+numbytes]
  test ecx,3   ; See if it's a multiple of four
  jz   multiple_of_four
multiple_of_four:
  push esi ; Can't know if caller needs this value preserved
  push edi ; Can't know if caller needs this value preserved
  mov esi,[ebp+src]
  mov edi,[ebp+dest]
  rep movsl
  pop edi
  pop esi
  ret  
 
Quite a number of compilers will perform such optimizations with memcpy().  I don't know of any that will do it with memmove, although in some cases an optimized version of memcpy may offer the same semantics as memmove.  For example, if numbytes was 20:
; Assuming values in eax, ebx, ecx, edx, esi, and edi are not needed
  mov esi,[src]
  mov eax,[esi]
  mov ebx,[esi+4]
  mov ecx,[esi+8]
  mov edx,[esi+12]
  mov edi,[esi+16]
  mov esi,[dest]
  mov [esi],eax
  mov [esi+4],ebx
  mov [esi+8],ecx
  mov [esi+12],edx
  mov [esi+16],edi
This will work correctly even if the address ranges overlap, since it effectively makes a copy (in registers) of the entire region to be moved before any of it is written.  In theory, a compiler could process memmove() by seeing if treading it as memcpy() would yield an implementation that would be safe even if the address ranges overlap, and call _memmove in those cases where substituting the memcpy() implementation would not be safe.  I don't know of any that do such optimization, though.