Using 2x idiv has a fundamental problem: we need the 2nd division to produce the low half of the quotient, which is unsigned and can be anything from 0 to 0xffff.
Only the highest word of a multi-word integer contains the sign bit, all bits below that have positive place-value. idiv's quotient range is -2^15 .. 2^15-1, not 0 .. 65535. Yes, idiv can produce all required values, but not from inputs we can get from simple fixups of the first division results. For example, 0:ffff / 1 will result in a #DE exception with idiv because the quotient doesn't fit in a signed 16-bit integer.
So the second division instruction has to be div, using the absolute value of the divisor, and an appropriate high half. (div requires both its inputs to be unsigned, so a signed remainder from a first idiv would be a problem, too.)
It might still be possible to use idiv for the first division, but only with fixups for its results before div, on top of still having to take the absolute value of the divisor and first-remainder to feed unsigned div. It's an interesting possibility, but in practice it's going to be cheaper to save and reapply signs around an unsigned division.
As @Martin points out, the first division of +1 / -1 with naive idiv gives the wrong high half quotient (0 / -1 = 0 not -1), and the wrong input for the 2nd division (0 % -1 = 0, not -1). TODO: figure out what fixups would actually be needed. Maybe just a conditional +-1, but we know the magnitude of the remainder can't get larger than the divisor, because high_half < divisor is necessary and sufficient for div to not fault.
Your -131,076/2 = -2 is (perhaps by coincidence) only off by 1 in one half of its result:
it should be 0xfffefffe = -2:-2 not -1:-2.
Optimized version of @rkhb's function, with DIV32 inlined.
We record input signs and then do unsigned division on the absolute values, and restore sign later. (If remainder sign isn't needed, we can simplify; the quotient sign only depends on xor dividend,divisor)
Or if the dividend is small enough, we can use one idiv. We have to avoid the -2^15 / -1 overflow case, though, so the fast check for DX being the sign extension of AX not only misses some safe cases (with larger divisors), but tries that unsafe case. If small numbers are common (like is the case in most computer programs), this fast test based on cwd is still maybe a good idea, with another test after the absolute-value calculation.
Branches are cheap-ish on 286, so I mostly kept the branching instead of using branchless abs(). (e.g. for a single register: with cdq (or sar reg,15) / xor / sub, like compilers make, based on the 2's complement identity that -x = ~x + 1). And mov/neg/cmovl of course isn't available until P6-family. If you need compat with 286 but mostly care about performance on modern CPUs, you might make different choices. But it turns out that a 32-bit branchless ABS is smaller code size than branching. However, it's probably slower than branchy for positive inputs, where some instructions would have been skipped. Assembler 8086 divide 32 bit number in 16 bit number has an interesting idea of creating 0/-1 integers for both dividend and divisor, and then for later re-applying the signs you can XOR those together and use the same XOR/SUB 2's complement bithack to sign-flip the results or not.
Style: local labels (within the function) prefixed with @@. I think this is normal for TASM, like NASM .label local labels.
; signed 32/16 => 32-bit division, using 32/16 => 16-bit division as the building block
; clobbers: CX, SI
IDIV32 PROC ; DX:AX / BX = DX/AX rem BX
;global IDIV32_16 ; for testing with NASM under Linux
;IDIV32_16:
; 99 / 5 = 19 rem 4
; 99 / -5 = -19 rem 4
; -99 / 5 = -19 rem -4
; -99 / -5 = 19 rem -4
mov cx, dx ; save high half before destroying it
;;; Check for simple case
cwd ; sign-extend AX into DX:AX
cmp cx, dx ; was it already correctly sign-extended?
jne @@dividend_32bit
; BUG: bx=-1 AX=0x8000 overflows with #DE
; also, this check rejects larger dividends with larger divisors
idiv bx
mov bx, dx
cwd
ret
;;; Full slow case: divide CX:AX by BX
@@dividend_32bit:
mov si, ax ; save low half
mov ax, cx ; high half to AX for first div
; CH holds dividend sign
mov cl, bh ; CL holds divisor sign
;;; absolute value of inputs
; dividend in AX:SI
cwd ; 0 or -1
xor si, dx ; flip all the bits (or not)
xor ax, dx
sub si, dx ; 2's complement identity: -x = ~x - (-1)
sbb ax, dx ; AX:SI = abs(dividend)
test bx, bx ; abs(divisor)
jnl @@abs_divisor
neg bx
@@abs_divisor:
;;; Unsigned division of absolute values
xor dx, dx
div bx ; high half / divisor
; dx = remainder = high half for next division
xchg ax, si
div bx
;;; absolute result: rem=DX quot=SI:AX
mov bx, dx
mov dx, si
;;; Then apply signs to the unsigned results
test cx,cx ; remainder gets sign of dividend
jns @@remainder_nonnegative
neg bx
@@remainder_nonnegative:
xor cl, ch ; quotient is negative if opposite signs
jns @@quotient_nonnegative
neg dx
neg ax ; subtract DX:AX from 0
sbb dx, 0 ; with carry
@@quotient_nonnegative:
ret
IDIV32 ENDP
Optimizations:
simpler sign-saving and sign-testing, using x86's built-in Sign Flag, set from the MSB of a result, and the js jump if SF==1. Avoids shifting the sign bit down to the bottom of an 8-bit register. Testing for same-signs can be done with xor/jns, because same signs will "cancel" and leave SF=0 whether it was both-0 or both-1. (In general, XOR can be used for comparison for equality, but it's usually only useful that way for bitwise cases where you care about one bit but not others).
Avoid writing CH by itself, for the benefit of modern Intel CPUs that do partial-register renaming for that case. This function never gets CH renamed separately from the rest of ECX. (On older CPUs like 286, there's no downside to mov cx,dx vs. mov ch,dh). We also avoid reading high-8 partial registers (e.g. test cx,cx instead of test ch,ch) because that has higher latency on recent Intel Sandybridge-family CPUs. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent). On P6-family, writing low-8 partial registers will rename them separately from the full register, so there it's probably best to just read 8-bit partial regs after writing any.
Of course, on modern CPUs 16-bit registers like cx are partial registers, even in 16-bit mode (because 32-bit registers are available there), so even mov cx,dx has a false dependency on the old value of ECX.
On 386+
Obviously on a 386+ where 32-bit registers / operand-size are available, you would use that even in 16-bit mode:
;; i386 version
;; inputs: DX:AX / BX
shl edx, 16
mov dx, ax ; pack DX:AX into EDX
mov eax, edx
movsx ebx, bx ; sign-extend the inputs to 32 bit EBX
cdq ; and 64-bit EDX:EAX
idiv ebx
; results: quotient in EAX, remainder in EDX
mov ebx, edx ; remainder -> bx
mov edx, eax
sar edx, 16 ; extract high half of quotient to DX
;; result: quotient= DX:AX, remainder = BX
This can #DE from BX=0, or on overflow from DX:AX=-2^31 and BX=-1 (LONG_MIN/-1)
Test harness:
NASM wrapper to call from 32-bit mode
%if __BITS__ = 32
global IDIV32
IDIV32:
push esi
push ebx
push edi ; not actually clobbered in this version
movzx eax, word [esp+4 + 12]
movzx edx, word [esp+6 + 12]
movzx ebx, word [esp+8 + 12]
call IDIV32_16
shl edx, 16
mov dx, ax
mov eax, edx
movsx edx, bx ; pack outputs into EDX:EAX "int64_t"
pop edi
pop ebx
pop esi
ret
%endif
C program, compile as 32-bit and link with the asm:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
// returns quotient in the low half, remainder in the high half (sign extended)
int64_t IDIV32(int32_t dxax, int16_t bx);
static int test(int a, short b) {
// printf("\ntest %d / %d\n", a, b);
int64_t result = IDIV32(a,b);
int testrem = result>>32;
int testquot = result;
if (b==0 || (a==INT_MIN && b==-1)) {
printf("successfully called with inputs which overflow in C\n"
"%d/%d gave us %d rem %d\n",
a,b, testquot, testrem);
return 1;
}
int goodquot = a/b, goodrem = a%b;
if (goodquot != testquot || goodrem != testrem) {
printf("%d/%d = %d rem %d\t but we got %d rem %d\n",
a,b, goodquot, goodrem, testquot, testrem);
printf("%08x/%04hx = %08x rem %04hx\t but we got %08x rem %04hx\n"
"%s quotient, %s remainder\n",
a,b, goodquot, goodrem, testquot, testrem,
goodquot == testquot ? "good" : "bad",
goodrem == testrem ? "good" : "bad");
return 0;
}
return 1;
}
int main(int argc, char*argv[])
{
int a=1234, b=1;
if(argc>=2) a = strtoll(argv[1], NULL, 0); // 0x80000000 becomes INT_MIN instead of saturating to INT_MAX in 32-bit conversion
if(argc>=3) b = strtoll(argv[2], NULL, 0);
test(a, b);
test(a, -b);
test(-a, b);
test(-a, -b);
if(argc>=4) {
int step=strtoll(argv[3], NULL, 0);
while ( (a+=step) >= 0x7ffe) { // don't loop through the single-idiv fast path
// printf("testing %d / %d\n", a,b);
test(a, b);
test(-a, -b);
test(a, -b);
test(-a, b);
}
return 0;
}
}
(This is sloppy between int and int32_t because I only care about it running on x86 Linux where that's the same type.)
Compile with
nasm -felf32 div32-16.asm &&
gcc -g -m32 -Wall -O3 -march=native -fno-pie -no-pie div32-test.c div32-16.o
Run with ./a.out 131076 -2 -1 to test all dividends from that to 0x7ffe (step=-1) with divisor=-2. (For all combinations of -a / -b, a / -b etc.)
I didn't do nested loops for quotient and divisor; you could do that with the shell. I also didn't do anything clever for testing some dividends around the max and some near the bottom of the range.