Count of negative numbers in assembly

Question

dosseg
.model small
.stack 100h
.data
    array db -1, -2, -3, -4, 1,2, 3, -5
.code
main PROC 
    mov ax, @data
    mov ds, ax
    xor ax, ax
    xor dx, dx ; reset dx 
    lea si, array
    mov cx, 8
    back: 
      mov bl, [si]
      cmp al, bl
      jc continue ; carry will be generated if number in bl is positive
      inc dx
      continue: 
        inc si
        clc
    loop back
    mov ah, 4ch
    int 21h
main ENDP 
end main

I wrote the above program to find the number of negative integers in an array.
Debugging showed that when SI is pointing at -1, the carry flag becomes 1 but it should not as the value at that instant in BL is FFh (negative) and in AL is 00h, so subtracting negative number from 0 should not generate a carry. What am I doing wrong?

Edit: I replaced the erroneous part with :

 test bl, bl 
 jns continue

and now it works as expected but I still don't know why the cmp method did not work.

`cmp` method didn't work becase you've chosen incorrect predicate which followed the compare instruction (`jc` instead of `jle`). — vitsoft, Apr 29 '22 at 19:22
@vitsoft jc = jump if carry, so the subtraction should generate a carry if bl has a positive value; shouldn't it? — learner, Apr 29 '22 at 19:30
Yes, it does set CF, if by *positive value* you mean values between 0x01 and 0xFF. Values between 0x80 and 0xFF are treated as **unsigned positive** when the compare is followed by `jc`, `ja`, `jb` . When you want to interpret them as **signed**, use `js`, `jg`, `jl`. — vitsoft, Apr 29 '22 at 21:06
The fun way since you have exactly 8 bytes: `movq xmm0, qword ptr [array]` (load 8 bytes) / `pmovmskb eax, xmm0` (pack the sign bit of each byte into an integer register) / `popcnt eax,eax` (count set bits). Would also work easily for 4 or 16 byte arrays, or for other compile-time-constant sizes do 2 loads and shift out overlapping bytes. With more you'd want to start accumulating counts in vector regs, like `pcmpgtb` to compare for `0 > x` / `psubb xmm1, xmm0` to do `total -= (0 or -1)`, up to 255 iterations of 16 bytes. Then accumulate with `psadbw` against zero. — Peter Cordes, Apr 30 '22 at 07:32

vitsoft · Answer 1 · 2022-04-29T19:17:20.700

When you compare al=0 with bl, carry flag (alias Below flag) will be set for any value in bl except for bl=0, because 0 is below any unsigned number in the range 0x01..0xFF.
Your array contains 8bit signed integer numbers. When we compare signed numbers, instead of adverbs below|above we use lower|greater which take into account Signum flag and Overflow flag.

Instead of jc continue ; carry will be generated if number in bl is positive use
jle continue ; Jump if al=bl or if SF<>OF.

Better readable solution is to replace cmp al,bl with
test bl,bl
jns continue ; Skip incrementing dx when signum of bl is not set.

See also Jcc. You might output the result from DX using the returned errorlevel, just mov al,dl before int 21h.

Peter Cordes · Answer 2 · 2022-05-01T09:09:45.630

If you just want to branch, use signed or signed-compare conditions

test reg,reg / jns non_negative (not sign-bit-set) or jnl non_negative (not less-than) are equivalent after a compare with zero.

That uses the FLAGS and conditions for their normal semantic meaning, i.e. doing a normal signed compare.
(test same,same is equivalent to cmp against zero, always clearing OF and CF, and is a well-known optimization for cmp reg, 0)

What you're doing doesn't set CF in a way that reflects the sign-bit, so a jc (jump if CF set) isn't useful. You're counting non-zero numbers, ones where 0U < (unsigned)x is true.

Getting the carry flag set according to the MSB

It's only interesting to get your condition into CF if you're going to take advantage of that
by using adc dx, 0 or sbb dx, -1 to conditionally increment DX (when CF is 1 or 0, respectively.)

The sbb version is like dx -= -1 + CF so CF either cancels out the -1, or you subtract -1, i.e. add 1.

One way to get CF set according to the sign bit of a byte is simply to shift it out, e.g. shl bl, 1, if you don't mind destroying the value in BL. Equivalently, add bl,bl is also a 2-byte instruction but can run on more execution units on modern CPUs. (They both set FLAGS the same way, including CF).

It's not possible with a compare against zero. 0 - x always has a borrow (CF=1) for any non-zero x, and x - 0 never has carry-out.

Without modifying the register value, it is possible with cmp, though: 0x7f - x has unsigned wrapping (i.e. borrow output that sets CF) for x>=0x80 unsigned. i.e. for values with their MSB set.

   xor dx, dx              ; count = 0
   mov si, OFFSET array    ; LEA takes more bytes than mov-immediate.  Never use LEA without a register, except for x86-64 RIP-relative

;;;  The interesting part
   mov  al, 0x7f           ; outside the loop
back:                      ; do {
   cmp  al, [si]             ; CF = 0x7F <(unsigned)[SI].  i.e. MSB set in [si]
   adc  dx, 0                ; count negative values
;;;  then the rest of the loop

   inc  si
   cmp  si, OFFSET array+8   ; the LOOP instruction isn't fast on most modern CPUs, and we're hard-coding the array length anyway.  Or just put a label at the end of it and use that.
   jne  back                ; }while(p != endp)

You don't need clc in this or your version. CF isn't "sticky"; anything that updates its value sets it to 0 or 1 regardless of the old value. And it's not an input for cmp.

We can't set CF=1 for bl < 0 (aka bl >= 0x80U) with cmp bl, constant, unfortunately. It only works the way you're doing it, setting another register to compare against. (cmp reg, 123 exists, cmp 123,reg doesn't; most 2-operand instructions modify their destination and wouldn't make sense with an immediate destination, so it would be a special case to have yet another opcode for cmp in the other direction.)

But you can do cmp bl, 0x80 to clear CF when bl < 0x80, i.e. when its sign bit isn't set.

   cmp  byte ptr [si], 0x80        ; CF = [si] < (unsigned)0x80, i.e. non-negative
   sbb  dx, -1                     ; count when CF=0, negative values

Loading the value into a register with mov bl, [si] can be helpful for debugging, making it show up in your debugger's window of registers instead of having to examine memory. But that's not necessary; cmp works with reg or memory operands (or an immediate), saving an instruction.

As a further optimization for code-size inside the loop, scasb is equivalent to cmp al, es:[di] / inc di (but the inc part doesn't set FLAGS.) And it's actually dec di if DF is set, so you'd want cld somewhere in your program before a loop using "string" instructions to make sure they go in the forward direction.

Using scasb means you need to use AL for that. Without scasb, you could count into AL inside the loop, where it could be the exit status for your DOS call. (Perhaps that's why you were trying to use AL=0, if you wanted to exit(0) instead of returning a value.)

scasb isn't particularly fast on modern CPUs, but it is on real 8086; so is the loop instruction, because they're both compact code-size. loop is a special-case optimization for dec cx/jnz (but also without affecting FLAGS).

Or with 386 instructions, bt word ptr [si], 7 to Bit Test that bit, putting the result in CF where you can add dx, 0. bt is slow on modern CPUs with bt mem, reg (like 10 uops) because it can index outside the word indexed by the addressing mode. So it would be less efficient put bt word ptr [array], cx in a loop with cx initially = 7 and incrementing with add cx, 8 inside the loop. But that would work.

bt is not too bad with bt mem, imm, only 2 uops on most modern Intel and 1 on some AMD (https://uops.info/). It's only a single uop for bt reg, imm or bt reg,reg, like cmp, if you want to load first. (It can't macro-fuse with branches into a single uops, so if branching instead of adc, a cmp/jle would be more efficient as well as more readable.) On AMD, bts/btr/btc to also modify the bit are slower than bt even for reg,reg, decoding to extra uops.

SSE2 + popcnt to check 4, 8, or 16 bytes at once

The extra fun way, since you have exactly 8 bytes, uses SSE2 and popcnt. (Yes this can work in 16-bit real mode, unlike AVX. In a bootloader and maybe DOS you'd have to manually enable the control-register bits that make SSE instructions not fault. Of course it only works on CPUs with popcnt, like Nehalem and later from 2008 or so, otherwise use pcmpgtb / psadbw / movq for just SSE2, or SSE1 using MMX registers.)

  movq      xmm0, qword ptr [array]  ; load 8 bytes (zero-extending to a 16-byte XMM reg)
  pmovmskb  ax, xmm0                 ; pack the sign bit of each byte into an integer reg
  popcnt    ax, ax                   ; count set bits = sign bits of the bytes

Would also work easily for 4 or 16 byte arrays, or for other compile-time-constant sizes, do 2 loads and shift out overlapping bytes.

For other element sizes, there's movmskps (dword) and movmskpd (qword)

With a larger array, you'd want to start accumulating counts in vector regs, like pcmpgtb to compare for 0 > x / psubb xmm1, xmm0 to do total -= (0 or -1), up to 255 iterations of 16 bytes. Then accumulate with psadbw against zero. Same problem as How to count character occurrences using SIMD but replacing pcmpeqb with pcmpgtb.

Count of negative numbers in assembly

2 Answers2

If you just want to branch, use signed or signed-compare conditions

Getting the carry flag set according to the MSB

SSE2 + popcnt to check 4, 8, or 16 bytes at once