I'm trying to create application like in title. Generally, almost everything work fine, except multithreading in ASM. When I want to use multithreading during multiply, sometimes some elements from the first column in the first row in the result matrix is 0. I was trying everything and I can't find my mistake, so I want to ask your help. There is code written in c++:
void multiply(int* resultRow, int* row, int* column, int size)
    {
        int* startCol = column;
        int* start = row;
        for (int i = 0; i < size; i++)
        {
            column = startCol;
            column += i;
            (*resultRow) = 0;
            row = start;
            for (int j = 0; j < size; j++)
            {
                (*resultRow) += ((*row) * (*column));
                row++;
                column += size;
            }
            resultRow++;
        }
    }
This function works fine even with multithreading(resultRow is address of i-th row in matrix, row and columns are addresses of exact row and column in matrices I want to multiply) There is an ASM code:
         .CODE
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
AsmMultiplication PROC loopCount: qword, secondLoopCount: qword, startColAddress : qword, startRowAddress : qword, count : qword, matrixSize : qword                                                                                                                                                               
                        ; resultRow in RCX
                        ; rowToMultiply in RDX
                        ; colToMultiply in R8
                        ; size int R9
mov matrixSize, R9
mov loopCount, R9
mov secondLoopCount, R9
mov count, 0
mov R10, RDX
mov R9, RCX
mov startColAddress, R8
mov startRowAddress, R10
loop1:
mov R8, startColAddress             ; column = startColAddress
mov R10, startRowAddress            ; row = startRowAddress
mov RAX, count                      ; |
mov RCX, 4                          ; |
mul RCX                             ; |
add R8, RAX                         ; | column += i
xor RAX, RAX                        ; |
mov [R9], RAX                       ; (*resultRow) = 0
mov RAX, matrixSize                 ; |
mov loopCount, RAX                  ; |
pxor xmm2, xmm2                     ; | preparing for multiplying in loop2
inc count
    
            loop2:
            movq xmm0, qword ptr [R10]          ;move actual row element to vector
            movq xmm1, qword ptr [R8]           ;move actual column element to vector
            pmuludq xmm0, xmm1                  ;multiply vectors
            paddq xmm2, xmm0                    ;add result to third vector
            add R10, 4                          ; row++
            mov RAX, matrixSize                 ; |
            mov RDX, 4                          ; |
            mul RDX                             ; |
            add R8, RAX                         ; | column += size
            mov RDX, loopCount                  ; |
            dec RDX                             ; | decrementing loop counter
            mov loopCount, RDX                  ; | 
            jnz loop2                           ; | if loopCount == 0 break
movq RAX, xmm2                      ; |
mov [R9], RAX                       ; | resultRow = rows * columns
add R9, 4                           ; | resultRows++
mov RDX, secondLoopCount            ; |
dec RDX                             ; |
mov secondLoopCount, RDX            ; |
jnz loop1                           ; | if secondLoopCount == 0 break
ret
AsmMultiplication ENDP
end
There how im using multithreading:
    public void ThreadedFunction(int size, int rows)
            {
                unsafe
                {
                    fixed (int* resultRow = &m3.matrix[rows, 0])
                    fixed (int* rowToMultiply = &m1.matrix[rows, 0])
                    fixed (int* colToMultiply = &m2.matrix[0, 0])
                        if(Asm == false)
                        {
                            MatrixMultiplication.App.multiply(resultRow, rowToMultiply, colToMultiply, size);
                        }
                        else
                        {
                            MatrixMultiplication.App.AsmMultiplication(resultRow, rowToMultiply, colToMultiply, size);
                        }
                        
                }
            }
    ...
 for (int i = 0; i < threadsCount; i++)
                {
                    threads[i] = this.StartTheThread(size, rows);
                    rows++;
There is a simple result matrix in .txt file with multithreading: Correct: enter image description here
and not correct: enter image description here
i have no idea why sometimes output is correct and sometimes not, but that mistake is only i first row at result matrix. Could anyone explain whats wrong? I know im using in n threads the same rows and columns in matrixes but why then c++ code is working fine ?
