I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.
_string_length1:
    push rbp
    mov rbp, rsp
    xor rax, rax
.body:
    cmp byte [rdi], 0
    je .exit
    inc rdi
    inc rax
    jmp .body
.exit:
    pop rbp
    ret
This is the C function I'm trying to beat:
size_t string_length2(const char *str) {
  size_t ret = 0;
  while (str[ret]) {
    ret++;
  }
  return ret;
}
And it disassembles to this:
string_length2:
    push rbp
    mov rbp, rsp
    mov rax, -1
LBB0_1:
    cmp byte ptr [rdi + rax + 1], 0
    lea rax, [rax + 1]
    jne LBB0_1
    pop rbp
    ret
Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers.  It worked without using a stack frame (when I tested on x86-64), but is it necessary?
 
     
     
    