NASM: move from memory to memory via registers

Question

I am difficulties moving data from memory to another memory in bss. My implementation somewhat works however when moving some weird characters appear in the first couple bytes and half of my string is also missing while the other half is fine.

This is the value I get when I print out message == �F@elcome to the new life

I need all the help, what am I missing? I went over my code a hundred times.

    section .data
hello:       db  "Hello, Welcome to the new life! Lets begin the journey.",10
hello_len:   equ  $ - hello

    section .bss
message: resb 255

    section .text

mov rdi, hello
mov rsi, message

msg_into_message:
    cmp byte [rdi], 10          ; hello ends with a new line char
    je end_count
    mov al, byte [rdi]
    mov byte [rsi], al
    inc rsi
    inc rdi
    jmp msg_into_message

end_count:
    mov [message], rsi
    ret

    ; Prints message
    mov rsi, message
    mov rdx, hello_len
    call pre_print
    syscall

Why are you not using [movsb](http://faydoc.tripod.com/cpu/movsb.htm) to move from RSI to RDI? — David C. Rankin, Oct 27 '19 at 02:17
i know that's probably not their first concern, but wouldn't movsb be dog slow ? Because I know if i tell a compiler to generate code for void f(char *d, char *s) { *d = *s; } it's gonna do mov al, [rsi]; mov [rdi], al — Gabriel Ravier, Oct 27 '19 at 02:22
@GabrielRavier: Yup, `movsb` on Skylake is 5 uops (> 4 thus needing the microcode sequencer), with a throughput of 1 per 4 cycles if used back-to-back (https://agner.org/optimize/). And it wouldn't leave the value in a register for the loop condition. `rep movsb` using the assemble-time-constant `hello_len` might be better than a byte-at-a-time loop, though; the string isn't trivially short. Of course 4x `movdqu` loads+stores (with partial overlap for the last one) can cover the whole 56-byte string so that would be more efficient even than `rep movsb` startup, the string isn't long either. — Peter Cordes, Oct 27 '19 at 02:35
@GabrielRavier: And BTW, gcc would actually do `movzx eax, byte [rsi]` to avoid a false dependency. No reason to ask the CPU to merge a byte into the old value of RAX if you aren't going to read it. `movzx` loads on x86 are equivalent to `lbu` or `ldrb` loads on MIPS or ARM. `mov al, [mem]` loads are CISC partial-register weirdness that you generally want to avoid. — Peter Cordes, Oct 27 '19 at 02:51
@PeterCordes: I wonder why gcc is alone in doing this when clang and icc don't do it. I just checked the performance for this on any chips as I could (all those supported by llvm-mca) and this is the result : https://godbolt.org/z/RMacpG Seems like the movzx is faster by 2% on silvermont-tremont (where you'd have the "CISC weirdness") and the mov is 2% faster on bdver1-2 and btver2 (maybe because it's smaller ?) and all all other arches they're equal — Gabriel Ravier, Oct 27 '19 at 06:03
@GabrielRavier: Those LLVM-MCA numbers for 2 isolated insns aren't really meaningful I don't think. On most CPUs, the main potential downside is a false dependency coupling this load/store onto the end of whatever last used RAX in program order. (Many compilers do write partial regs sometimes so it needs to be supported somewhat efficiently. But on Haswell and later `mov al, [mem]` is a micro-fuse load + ALU-merge, instead of renaming `al` separately from RAX. [see this Q&A](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake) — Peter Cordes, Oct 27 '19 at 06:17
@PeterCordes The thing is, gcc's behaviour is constant across all arches and clang's and icc's behaviour is constant across all arches, so do they just not have these optimized for specific architectures ? — Gabriel Ravier, Oct 27 '19 at 18:27
@GabrielRavier: Intel CPUs before Haswell did rename AL separately from RAX, avoiding this risk, so it's a recent change. I haven't benchmarked in detail, but I'm pretty sure clang's code-gen could cause problems if it gets unlucky and the full register *was* the end of a long dep chain (e.g. cache-miss loads). Usually it's not. So it's basically saving a byte of code size but taking a risk of false dependencies. When it would cost another instruction to break deps, like `pxor xmm0, xmm0` / `cvtsi2ss xmm0, eax`, that makes some sense, but not for 1 byte for `mov al` vs. `movzx eax` IMO. — Peter Cordes, Oct 27 '19 at 18:38
@GabrielRavier: Was just playing with some code in clang where it uses `movzx` loads for `char`: https://godbolt.org/z/3ISZHc. clang3.9 and newer use `movzx`, clang 3.8 and older use `mov` with a possible false dependency. (Regardless of `-march=sandybridge` vs. `-march=haswell`. I think compilers (and many compiler devs) aren't super aware of *exactly* how partial registers behave.) — Peter Cordes, Oct 27 '19 at 20:09
@PeterCordes Yeah, the lack of the generation for them changing depending on march seems like something I'd want to send to their bugzillas — Gabriel Ravier, Oct 28 '19 at 12:49

paxdiablo · Accepted Answer · 2019-10-27T04:24:10.720

Okay, first things first, the s and d in rsi and rdi stand for source and destination. It may work the other way (as you have it) but you'll upset a lot of CDO people like myself^(a) :-)

But, for your actual problem, look here:

end_count:
    mov [message], rsi

I assume that's meant to copy the final byte 0x10 into the destination but there are two problems:

message is the start of the buffer, not the position where the byte should go.
You're copying the multi-byte rsi variable into there, not the byte you need.

Those two points mean that you're putting some weird value into the first few bytes, as your symptoms suggest.

Perhaps a better way to do it would be as follows:

    mov rsi, hello            ; as Gordon Moore intended :-)
    mov rdi, message

put_str_into_message:
    mov al, byte [rsi]        ; get byte, increment src ptr.
    inc rsi

    mov byte [rdi], al        ; put byte, increment dst ptr.
    inc rdi

    cmp al, 10                ; continue until newline.
    jne put_str_into_message

    ret

For completeness, if you didn't want the newline copied (though this is pretty much what you have now, just with the errant buffer-damaging mov taken away) ^(b):

put_str_into_message:
    mov al, byte [rsi]        ; get byte.
    cmp al, 10                ; stop before newline.
    je  stop_str

    mov byte [rdi], al        ; put byte, increment pointers.
    inc rsi
    inc rdi

    jmp put_str_into_message

stop_str:
    ret

^(a) CDO is obsessive-compulsive disorder, but with the letters arranged correctly :-)

^(b) Or the don't-copy-newline loop can be done more efficiently, while still having a single branch at the bottom.

Looping one byte at a time is still very inefficient (x86-64 has SSE2 which lets you copy and check 16 bytes at a time). Since you have the length as an assemble-time constant hello_len, you could use that to efficiently copy in wide chunks (possibly needing special handling at the end if your buffer size is not a multiple of 16), or with rep movsb.

But this demonstrates an efficient loop structure, avoiding the false dependency of merging a new AL into the bottom of RAX, allowing out-of-order exec to run ahead and "see" the loop exit branch earlier.

strcpy_newline_end:
    movzx  eax, byte [rsi]    ; get byte (without false dependency).
    cmp    al, 10
    je    copy_done           ; first byte isn't newline, enter loop.

copy_loop:                    ; do:
    mov    [rdi], al          ;    put byte.
    inc    rsi                ;    increment both pointers.
    inc    rdi
    movzx  eax, byte [rsi]    ;    get next byte.
    cmp    al, 10
    jne   copy_loop           ; until you get a newline.

; After falling out of the loop (or jumping here from the top)
; we've loaded but *not* stored the terminating newline

copy_done:
    ret

You should also be aware there are other tricks you can use, to save instructions inside the loop, such as addressing one string relative to the other (with an indexed addressing mode for the load, only incrementing one pointer).

However, we won't cover them in detail here as it risks making the answer more complex than it needs to be.

The don't-copy-newline loop can be "skewed" or whatever it's called so you can still put the conditional branch at the bottom. Peel the first load and put the store at the top. (Or start by jumping to the load + compare at the bottom of the loop.) — Peter Cordes, Oct 27 '19 at 02:27
@Peter, does that gain us anything? We still have a conditional and unconditional jump, just in different places. — paxdiablo, Oct 27 '19 at 02:29
*Inside the loop* you only have one conditional jump, at the bottom. Looking for an example, I was thinking I'd included one in [Why are loops always compiled into "do...while" style (tail jump)?](//stackoverflow.com/q/47783926) but no, I didn't. — Peter Cordes, Oct 27 '19 at 02:39
Added an example of what I was talking about, along with using movzx. To keep code-size down, you'd just `jmp .loop_entry` like `gcc -O1` does IIRC, instead of peeling the first load + cmp/je. Also mentioned that looping 1 byte at a time is total garbage on x86-64, to hopefully pique the curiosity of some readers. — Peter Cordes, Oct 27 '19 at 03:09
BTW, the "special handling" for a non-overlapping memcpy is easy when you know the source is at least 16 bytes: leave the main loop while there are 1..16 bytes left to copy. Do a `movdqu` load of the *last* 16-byte chunk of the buffer, potentially overlapping with a previous vector. Store buffer + cache can easily absorb the overlap, making that more efficient than more branchy cleanup using shorter vectors. Especially if you had 15 bytes left = non-overlapping cleanup of 8+4+2+1 bytes. (Leaving this in a comment; it would bloat the answer too much to explain that level of detail.) — Peter Cordes, Oct 27 '19 at 04:35
Thanks, @Peter, you've certainly provided detail over and above what was needed (a simple "remove the `mov` statement"). I do appreciate the education (though I footnoted it so it wouldn't interfere) but I think you're right in that the answer's probably complicated enough for now :-) — paxdiablo, Oct 27 '19 at 04:37

NASM: move from memory to memory via registers

1 Answers1