Arm Assebly for RPI3 b+ why make xor on register for counter?

Question

i was trying to make a program to blink a RPI3 b+ with Armv7 Assembly and notice that it wasn't working using this code for the delay function

delay:
    b loop

loop:
    add r10, r10, #1
    cmp r10, r4
    bne loop
    beq return

return:
    mov r10, #0
    bx lr

r10 is the register used for the counter and r4 contains that r10 needs to reach to stop and get back to the main code. After looking a tutorial I've found that they make a xor operation for the counter register, I've added the correction and now the code looks like this.

delay:
    eor r10, r10, r10
    b loop

loop:
    add r10, r10, #1
    cmp r10, r4
    bne loop
    beq return

return:
    mov r10, #0
    bx lr

I've compiled, loaded it into the rpi3 and now it works, but why I had to add that line, I know what a xor gate those, but if the two inputs are equal, It'll return the exact same value. What is the sense of this operation?

Then you misunderstood `xor`. If the inputs are the same it produces a zero. It is common practice to use that to zero a register. Since you want `r10` to count from zero, you should zero it. You don't need to use `xor` for it of course, you can just as well `mov r10, #0`. Also the `b loop` is useless (it's the next instruction anyway). It's unclear why `r10` is zeroed at the end before returning but that is probably a bad idea. — Jester, Aug 18 '23 at 15:14
It's common practice to use xor to zero a register **on x86**, because of various quirks and history that led to it being more efficient than the obvious `mov` instruction. None of that applies to ARM, so you can and should use `mov r10, #0` which is clearer and at least as efficient, possibly more so because there is no false read dependency on `r10`. — Nate Eldredge, Aug 18 '23 at 15:19
To add to the code reviewing, two conditional branches in a row are unnecessary and so poor practice; a conditional branch to the very next instruction is also pointless. — Erik Eidt, Aug 18 '23 at 16:01

score 3 · Accepted Answer · answered Aug 19 '23 at 02:26

TL:DR: XOR same,same is similar to sub same,same, producing zero.

This tutorial is not good, and neither is XOR-zeroing on ARM, or any RISC ISA. Only use it in x86 asm (and 8080), not in asm for other ISAs, and not in high-level languages.

but if the two inputs are equal, It'll return the exact same value.

No, that would be regular non-exclusive OR. XOR gives you the bits that were different. When both inputs are the same, the result is 0.

XOR-zeroing is good only on x86. (See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for details why). None of those reasons apply on ARM: mov reg, #0 is the same size in machine code as eor reg,reg,reg, so there was no historical reason to support EOR as a "zeroing idiom" that's special cased by modern CPUs.

(This is true even in Thumb code, although in that case you want movs reg, #0 for the smaller encoding, at least with r0-r7. r8-r14 need a 4-byte Thumb2 encoding regardless of setting flags or not.)

In fact an ARM CPU isn't even architecturally allowed to optimize eor dst, same,same to break the false dependency, because memory dependency-ordering rules require EOR and other operations to carry a dependency. (e.g. for using the result of a std::memory_order_consume load.) Not that they'd bother spending transistors and power on it, since there's no reason for ARM machine code to use that in the first place when mov reg, #0 works perfectly well.

So eor r10, r10, r10 is clearly worse than mov r10, #0.

Never use it unless you want a 0 that has a dependency on the old value of R10. If you don't know what that means, you don't want it; it would only be useful in multithreaded code on a load result like a data_ready flag, or in microbenchmark experiments to test out-of-order scheduling, or latency vs. throughput by generating a constant value with a data dependency on some result.

On x86 it saved a byte of machine-code size vs. mov ax, 0, and 3 bytes in 32-bit mode, so real world code used it everywhere. Later CPUs evolved to make it still efficient even with out-of-order execution, where reading the old value of the register as an input would otherwise be a problem. (Unlike with mov reg, 0 which we expect not to have a false dependency even without any special support. mov is always dependency-breaking; the special casing of xor same,same on x86 merely makes it equal in that way. xor-zeroing is better in other ways on x86.)

This "tutorial" was clearly written as a learning exercise by another beginner (which is common for random tutorials you find on the Internet; it's a lot of work to write a good one).

It's not a an example of good efficient code, given that bug (missing zeroing a loop counter) and two useless b next_instruction instructions. Execution falls through to the next instruction anyway even if you don't b or beq return.

Most conditional branches should just be a compare and one branch, with the other path of execution being the fall-through. It's somewhat of an anti-pattern for beginner code to put another branch with the opposite condition one after the other. Or to make the bottom of a loop an while(1) { if(cond)break } instead of just do{}while(cond); - in your loop at least the useless branch is outside the loop. But it's a delay loop that exists only to waste time anyway, so really it's just wasting code size and changing the cycles-per-count delay factor.

If you need execution to go somewhere else in both cases (i.e. both possible targets are after other code that should fall through into it), then the second branch should be an unconditional b. And you should never write a branch that jumps to the next instruction in source order, because execution would go there anyway even if there was no branch.

Thanks man. Learning assembly is really confusing after using only higher level programming languages — jack07Code, Aug 19 '23 at 13:58
@jack07Code, start with what you already know from other languages, and see how the same is done in assembly: for, while loops; if-then/if-then-else; arrays and indexing; pointers; function calls. Each of those has a relatively direct and simple translation in assembly. — Erik Eidt, Aug 19 '23 at 15:37
@jack07Code: As a follow-up to Erik's comment, see [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for how to write simple C functions that compile to just the asm you want to see. — Peter Cordes, Aug 19 '23 at 18:45

artless noise · Answer 2 · 2023-08-19T15:15:23.960

I think eor vs mov is a red-herring. You are clearing r10 at the start of the routine and the end of the routine. According to the ABI, r10 is a callee save register. You can not know that r10 will be zero upon returning. Just move the mov.

Here is a routine that can be called from 'C'.

# Put count in `r0` and count down.
delay:
    ; you can add 'nop' instructions here to increase loop time.
    subs r0, r0, #1   ; subtract and set conditon codes
    bne delay         ; branch if not zero
    bx lr             ; return to caller.

mov Rx, #0 and eor Rx, Rx, Rx are functionally equivalent as in 'Rx' is zero afterwards. Timing, condition codes and other things may differ. But this is unlikely to be why your delay does not work.

It can be called from 'C' like delay(20);. If your entire code base is in assembler, it is likely that some register is clobbered elsewhere and you need to show the complete example (or give a link to the tutorial).

There are better examples of a delay that make the time constant (branch vs no branch), but this example is sufficient for learning.

Oldtimers code is probably better to read than whatever tutorial you are looking at. https://github.com/dwelch67 [DaveSpace](https://www.davespace.co.uk/arm/introduction-to-arm/) is also a good resource. [This book is cheap](https://www.thriftbooks.com/w/arm-architecture-reference-manual-2nd-edition/334876/) and very good. All different daves. — artless noise, Aug 19 '23 at 15:21

Arm Assebly for RPI3 b+ why make xor on register for counter?

2 Answers2

Linked