To actually answer your fs:0 question: The x86_64 ABI requires that fs:0 contains the address "pointed to" by fs itself. That is, fs:-4 loads the value stored at fs:0 - 4. This feature is necessary because you cannot easily get the address pointed to by fs without going through kernel code. Having the address stored at fs:0 thus makes working with thread local storage much more efficient.
You can see this in action when you take the address of a thread local variable:
static __thread int test = 0;
int *f(void) {
return &test;
}
int g(void) {
return test;
}
compiles to
f:
movq %fs:0, %rax
leaq -4(%rax), %rax
retq
g:
movl %fs:-4, %eax
retq
i686 does the same but with %gs. On aarch64 this is not necessary because the address can be read from the tls register itself.