The problem lies deep in the bowels of GAS, the GNU assembler, and how it generates DWARF debug information.
The compiler, GCC, has the responsibility of generating a specific sequence of instructions for a position-independent thread-local access, which is documented in the document ELF Handling for Thread-Local Storage, page 22, section 4.1.6: x86-64 General Dynamic TLS Model. This sequence is:
0x00 .byte 0x66
0x01 leaq  x@tlsgd(%rip),%rdi
0x08 .word 0x6666
0x0a rex64
0x0b call __tls_get_addr@plt
, and is the way it is because the 16 bytes it occupies leave space for backend/assembler/linker optimizations. Indeed, your compiler generates the following assembler for threadMain():
threadMain:
.LFB2:
        .file 1 "thread.c"
        .loc 1 14 0
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        .loc 1 15 0
        .byte   0x66
        leaq    obj@tlsgd(%rip), %rdi
        .value  0x6666
        rex64
        call    __tls_get_addr@PLT
        movl    $1, (%rax)
        .loc 1 16 0
        ...
The assembler, GAS, then relaxes this code, which contains a function call (!), down to just two instructions. These are:
- a movhaving anfs:-segment override, and
- a lea
, in the final assembly. They occupy between themselves 16 bytes in total, demonstrating why the General Dynamic Model instruction sequence is designed to require 16 bytes.
(gdb) disas/r threadMain                                                                                                                                                                                         
Dump of assembler code for function threadMain:                                                                                                                                                                  
   0x00000000004007f0 <+0>:     55      push   %rbp                                                                                                                                                              
   0x00000000004007f1 <+1>:     48 89 e5        mov    %rsp,%rbp                                                                                                                                                 
   0x00000000004007f4 <+4>:     48 83 ec 10     sub    $0x10,%rsp                                                                                                                                                
   0x00000000004007f8 <+8>:     48 89 7d f8     mov    %rdi,-0x8(%rbp)                                                                                                                                           
   0x00000000004007fc <+12>:    64 48 8b 04 25 00 00 00 00      mov    %fs:0x0,%rax
   0x0000000000400805 <+21>:    48 8d 80 f8 ff ff ff    lea    -0x8(%rax),%rax
   0x000000000040080c <+28>:    c7 00 01 00 00 00       movl   $0x1,(%rax)
So far, everything has been done correctly. The problem now begins as GAS generates DWARF debug information for your particular assembler code.
- While parsing line-by-line in - binutils-x.y.z/gas/read.c, function- void
read_a_source_file (char *name), GAS encounters- .loc 1 15 0, the statement that begins the next line, and runs the handler- void dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED)in- dwarf2dbg.c. Unfortunately, the handler does not unconditionally emit debug information for the current offset within the "fragment" (- frag_now) of machine code it is currently building. It could have done this by calling- dwarf2_emit_insn(0), but the- .lochandler currently only does so if it sees multiple- .locdirectives consecutively. Instead, in our case it continues on to the next line, leaving the debug information unemitted.
 
- On the next line it sees the - .byte 0x66directive of the General Dynamic sequence. This is not, in and of itself, part of an instruction, despite representing the- data16instruction prefix in x86 assembly. GAS acts upon it with the handler- cons_worker(), and the fragment increases from 12 bytes to 13 in size.
 
- On the next line it sees a true instruction, - leaq, which is parsed by calling the macro- assemble_one()that maps to- void md_assemble (char *line)in- gas/config/tc-i386.c. At the very end of that function,- output_insn()is called, which itself finally calls- dwarf2_emit_insn(0)and causes debug information to be emitted at last. A new Line Number Statement (LNS) is begun that claims that line 15 began at function-start-address plus previous fragment size, but since we passed over the- .bytestatement before doing so, the fragment is 1 byte too large, and the computed offset for the first instruction of line 15 is therefore 1 byte off.
 
- Some time later GAS relaxes the Global Dynamic Sequence to the final instruction sequence that starts with - mov fs:0x0, %rax. The code size and all offsets remain unchanged because both sequences of instructions are 16 bytes. The debug information is unchanged, and still wrong.
 
GDB, when it reads the Line Number Statements, is told that the prologue of threadMain(), which is associated with the line 14 on which is found its signature, ends where line 15 begins. GDB dutifully plants a breakpoint at that location, but unfortunately it is 1 byte too far.
When run without a breakpoint, the program runs normally, and sees
64 48 8b 04 25 00 00 00 00      mov    %fs:0x0,%rax
. Correctly placing the breakpoint would involve saving and replacing the first byte of an instruction with int3 (opcode 0xcc), leaving
cc                              int3
48 8b 04 25 00 00 00 00         mov    (0x0),%rax
. The normal step-over sequence would then involve restoring the first byte of the instruction, setting the program counter eip to the address of that breakpoint, single-stepping, re-inserting the breakpoint, then continuing the program.
However, when GDB plants the breakpoint at the incorrect address 1 byte too far, the program sees instead
64 cc                           fs:int3
8b 04 25 00 00 00 00            <garbage>
which is a wierd but still valid breakpoint. That's why you didn't see SIGILL (illegal instruction).
Now, when GDB attempts to step over, it restores the instruction byte, sets the PC to the address of the breakpoint, and this is what it sees now:
64                              fs:                # CPU DOESN'T SEE THIS!
48 8b 04 25 00 00 00 00         mov    (0x0),%rax  # <- CPU EXECUTES STARTING HERE!
# BOOM! SEGFAULT!
Because GDB restarted execution one byte too far, the CPU does not decode the fs: instruction prefix byte, and instead executes mov (0x0),%rax with the default segment, which is ds: (data). This immediately results in a read from address 0, the null pointer. The SIGSEGV promptly follows.
All due credits to Mark Plotnick for essentially nailing this.
The solution that was retained is to binary-patch cc1, gcc's actual C compiler, to emit data16 instead of .byte 0x66. This results in GAS parsing the prefix and instruction combination as a single unit, yielding the correct offset in the debug information.