Computing the factorial of 10 using 8086 assembly

Question

I have been trying to solve this using assembly language. The thing is I cannot store the 10! In al and my code works for finding factorial of 5. How do I store my result of 10! In a register? When I find factorial of 5 I can see the result clearly in al because 120 can be stored in al.

Any help would be appreciated.

Here's my code for 5!

 org 100h
.DATA
ANS DB ? 
.CODE
MAIN PROC
    MOV AX,@DATA
    MOV DS,AX
    MOV AL,5                     
    MOV CL,4
    MOV BL,AL
    SUB BL,1
    L:
    MUL BL
    SUB  BL,1
    LOOP L

    MOV ANS,AL
    END MAIN
ret

Well, indeed, 10! cannot be represented by a 16-bit-wide register. You need to either use some sort of 32-bit-related facilities of 8086 (if the ISA has any), or alternatively - compute the lower and the higher 16-bits separately, in two registers. — einpoklum, Feb 01 '20 at 09:20
10! = 0x375f00, and even 9! is 0x58980, so you couldn't just loop upward. But 8! is only `0x9d80` so you could `inc bx` / `mul bx` inside a loop that ends with `DX:AX = 0x58980` as the output from `9 * 0x9d80` with a 16-bit `mul`. So you can do that and then you're one extended-precision multiply away from the right answer if you loop as far as you can with `inc bx` / `mul bx`. You can do the times 10 as `(x << 1) + (x<<3)`. e.g. see discussion in comments on [Shift right by half a bit](//codegolf.stackexchange.com/posts/comments/472043) for a 64-bit (in 16-bit mode) multiply by 10. — Peter Cordes, Feb 01 '20 at 09:32
BTW, your current code is pointlessly complicated, and if you need large number obviously you should be using full-width registers. If you start with CX = AX-1, a loop that does `mul cx` / `loop` would work for small inputs. No need to use another down-counter as well. (But like I said, you want the last multiply to be by a larger number so the double-width result of the last `mul` might not overflow.) — Peter Cordes, Feb 01 '20 at 09:35
@MarkPlotnick: The answers there are over-complicated because they avoid the hardware `mul` instruction for no apparent reason. A 16 x 32 => 32-bit product on x86 only takes 2 `mul` instructions for low * low and low * high, and one add. (And more `mov` or `xchg` instructions because we don't have `imul reg, reg` for the high half.) — Peter Cordes, Feb 01 '20 at 09:37
@PeterCordes OK, I've deleted the comment with the link to that question/answer. — Mark Plotnick, Feb 01 '20 at 09:47
@MarkPlotnick: I looked but didn't have much luck finding a 32x32 => 32 Q&A for 8086, only 64x64 => 64 or 128 for 32-bit mode. Just for fun I'm writing a `mul` answer for [Multiplying 32 bit two numbers on 8086 microprocessor](//stackoverflow.com/q/29246857) — Peter Cordes, Feb 01 '20 at 09:49

rkhb · Answer 1 · 2020-02-01T13:08:39.727

I tried to explain the algorithm to multiply a dword by a word in a 16-bit environment here: https://stackoverflow.com/a/28683170/3512216

An implementation that computes 10! and works also in EMU8086:

.MODEL SMALL
.STACK 1000h

.DATA
decstr DB 16 DUP ('$')                  ; String is $-terminated

.CODE

main PROC
    mov ax, @DATA                       ; Initialize DS
    mov ds, ax

    mov bx, 10                          ; Factorial 10! = 3.628.800
    xor dx, dx                          ; DX:AX=1 (first multiplicand)
    mov ax, 1                           ; Begin with 1

    ; for (dx:ax = 1, cx = 2; cx <= 10; cx++)
    mov cx, 2                           ; Incrementing multiplicator
    L1:
    call mul_dword_word                 ; DX:AX * CX -> DX:AX
    inc cx
    cmp cx, bx
    jbe L1                              ; While cx <= 10

    ; Print result
    mov di, OFFSET decstr
    call dword_to_dec
    mov dx, OFFSET decstr
    mov ah, 9
    int 21h

    ; Exit
    mov ax, 4C00h
    int 21h
main ENDP

mul_dword_word PROC                     ; DX:AX multiplicand, CX multiplier
    push dx

    mul cx                              ; AX * CX -> DX:AX
    mov si, dx                          ; Store high result
    mov di, ax                          ; Low result won't be changed anymore

    pop ax                              ; High word
    mul cx                              ; AX * CX -> DX:AX
    add ax, si                          ; Add high result from last mul to low result here
    adc dx, 0

    mov si, dx                          ; SI:DX:AX return value
    mov dx, ax
    mov ax, di
    ret                                 ; RET: SI:DX:AX result
mul_dword_word ENDP

dword_to_dec PROC                       ; ARG DX:AX DWORD, DI: offset of string

    mov cs:target, di
    mov si, ax
    mov di, dx

    ; First Loop: get digits and push them
    mov cs:counter, 0
    mov bx, 10
    LL1:
    inc cs:counter
    xor dx, dx
    mov ax, di                          ; High WORD
    mov cx, ax
    div bx                              ; DX:AX / BX -> AX Remainder DX
    mov di, ax                          ; Store new high word
    mul bx                              ; AX * BX -> DX:AX
    sub cx, ax                          ; sub highest CX-divisible value

    mov dx, cx
    mov ax, si                          ; Low WORD
    div bx                              ; DX:AX / BX -> AX Remainder DX
    or dl, 30h                          ; Convert remainder to ASCII
    push dx                             ; Store remainder
    mov si, ax                          ; Store new low WORD

    or ax, di                           ; Anything more to process?
    jnz LL1                             ; yes: jump to LL1 above

    ; Second Loop: get back digits in reversed order
    mov di, cs:target
    mov cx, cs:counter
    LL2:
    pop ax
    mov [di], al
    inc di
    loop LL2
    mov BYTE PTR [di], '$'              ; Terminator for INT 21h/09h

    ret
    counter dw 0
    target dw 0
dword_to_dec ENDP

You don't need a 48-bit product to hold `10!`. And taking an arg in DX:AX is actually *less* convenient. I found when writing [Multiplying 32 bit two numbers on 8086 microprocessor](//stackoverflow.com/a/60015995) that it seems easier to do the low x high cross product(s) first so you can do the low x low last. For your loop I think that would save on `mov` instructions, and you already produce the result in regs other than si:dx:ax so you need 3 `mov` at the end which you could avoid with a different choice. — Peter Cordes, Feb 01 '20 at 13:23
@PeterCordes: `mul_dword_word` is a reusable (and reused by me) function. 32-bit multiplied by 16-bit can result in a 48-bit integer. It would not be serious if the multiplication returns only a part of the possible results. Of course the function can be trimmed for the respective purpose - but that should be the task of the OP. At the moment you can compute with it up to 17!. A screen output for an 48-bit number would be interesting. — rkhb, Feb 01 '20 at 15:35
@PeterCordes: The next step would be a BigNumber implementation. According to my feeling it is then easier to start with the lowest values. If you think you have a better algorithm, you are free to post another answer. I myself have found a lot of theory but hardly any useful examples so far. — rkhb, Feb 01 '20 at 15:36

Brendan · Answer 2 · 2021-02-14T17:50:37.413

10! is 10*9*8*7*6*5*4*3*2*1.

For speed you can minimize the size of intermediate values by doing them in "big * little" pairs, like (1*10) * (2*9) * (3*8) * (4*7) * (5*6).

You can describe these pairs as the expression temp = x * (11 - x) (with values of x from 1 to 5). If you know the result for x (e.g. temp = 10 if x == 1) then the result of the next pair can be derived from the result of the previous pair. E.g.:

temp1 = x * (11 - x)

temp2 = (x+1) * (11 - (x+1))
      = (x+1) * (11 - x - 1)
      = x * (11 - x - 1) + (11 - x - 1)
      = x * (11 - x) - x + (11 - x - 1)
      = temp1 - x + (11 - x - 1)
      = temp1 - x - x + 11 -1
      = temp1 - (x << 1) + 10

In other words; if you know that the result of "pair 1" (or 1*10) will be 10, then you can calculate the result of the next pair (or 2*9) by doing (10) - (1<<1) + 10 = 18;, then calculate the result of the next pair by doing (18) - (2<<1) + 10 = 24, then...

Note that there is no multiplication involved for calculating the pairs this way. After calculating the intermediate results from pairs (with add/sub alone), you end up with "10! = 10*18*24*28*30".

For more fun; you can prove that all of these intermediate results will always be even. That means you can cheat and say "10! = 10*18*24*28*30 = 5*9*12*14*15 << (1+1+1+1+1) = 5*9*12*14*15 << 5"; which is important when you're looking at doing more expensive "too big" multiplications later.

Because these intermediate values fit in 8 bits, we can do "pairs of pairs" just with two 16-bit multiplications. "10! = (5*9) * (12*14) * 15 << 5 = 45 * 168 * 15 << 5".

By multiplying the smallest intermediate values the result will still fit in 16 bits. "10! = 45 * 168 * 15 << 5 = (45*15) * 168 << 5 = 675 * 168 << 5".

Now you only need to do one multiplication where the result is 32 bits but both operands are 16 bit. Real mode handles this perfectly fine with a single mul instruction so that is no problem; it's just that the result will be stored in dx:ax. Specifically, for 675 * 168 the result in dx:ax will be 113400 (or 0x1BAF8, or 0x0001 in dx and 0xBAF8 in ax).

This gives us "10! = 113400 << 5". For this it'll be a little awkward on an old "16-bit only" CPU (where you can't just use 32-bit instructions in real mode, including 32-bit mul). The best way for 8086 is probably using another register for a "right shift then or" to get the bits from ax into dx; like:

mov cl,5
mov bx,ax
shl dx,cl
shl ax,cl
mov cl,16-5
shr bx,cl
or dx,bx

For 80186 and later you can do it slightly better using "shift by immediate" to avoid using cl, like shl dx,5, but that isn't supported for 8086.

The result will be 3628800, which is the right answer.

Note that this same approach can be tweaked to handle other factorials - for odd factorials you can ignore the *1 that doesn't actually do anything to ensure the intermediate results from pairs are still even (e.g. "11! = 1 * (2*11) * (3*10) * (4*9) * (5*8) * (6*7) = (2*11) * (3*10) * (4*9) * (5*8) * (6*7)"). For all factorials (n!); the intermediate calculation of pairs can be described as temp2 = temp1 - (x << 1) + (n & (~1)) (I think - didn't actually check); and the final shift will always be n/2 rounded down (e.g. for 11! the shift count will be "(int)11/2 = (int)5.5 = 5).

With 16-bit multiplication alone it should be good up until 15!. For 16! and larger, the same techniques work to minimize the number of 32-bit multiplications needed.

`shld` is a 386plus-only instruction. – ecm Feb 14 '21 at 17:26 — ecm, Feb 14 '21 at 17:26
@ecm: Ah - you're right (thanks!) - fixing – Brendan Feb 14 '21 at 17:43 — Brendan, Feb 14 '21 at 17:43

score -1 · Answer 3 · edited Feb 15 '21 at 06:35

It's because AL, BL, and CL are of 8 bits, but what you require is a 16-bit register. This can be done by replacing all L with their X counterparts. Although doing so won't help in storing 10! (which requires even more than 16-bits), but we can find 8!, for example (i.e. anything which can be stored in 16-bits).

org 100h

.DATA                                                                                                                                                                                                                                                                                                                                                                                                           
ANS DB ?
.CODE



MAIN PROC
  MOV AX, @DATA
  MOV AX, 8
  MOV CX, 7
  MOV BX, AX
  SUB BX, 1

  LOOP_LB:
  MUL BX
  SUB BX, 1
  LOOP LOOP_LB
END MAIN

ret

The screenshot of the output:

The `loop` instruction and a separate CX counter is kind of pointless; you could just use `jnz` to keep looping while BX != 0. Or maybe better, `dec bx` / `cmp bx, 1` / `ja LOOP_LB` to not re-run the loop for BX=1, saving an iteration and avoiding truncating the last DX:AX multiply result to 16-bit. — Peter Cordes, Feb 15 '21 at 01:34

Computing the factorial of 10 using 8086 assembly

3 Answers3