I'm trying to improve the performance of my code, but when I added new threads, my performance had dropped.
First version:
public int[][] calculate(int[][] matriz1, int [][] matriz2, int matrixSize) {
    int[][] matrix = new int[matrixSize][matrixSize];
    for(int i = 0; i < matrixSize; i++){
        for(int k = 0; k < matrixSize; k++){
            for(int j = 0; j < matrixSize; j++){
                matrix[i][j] = matrix[i][j] + matriz1[i][k] * matriz2[k][j];
            }
        }
    }
    return matrix;
}
Second version:
public int[][] calculate(int[][] matriz1, int[][] matriz2, int matrixSize) {
    final int[][] matrix = new int[matrixSize][matrixSize];
    CountDownLatch latchA = new CountDownLatch((int) (Math.pow(matrixSize, 3)));
    List<Thread> threads = new ArrayList<>();
    for (int i = 0; i < matrixSize; i++) {
        finalI = i;
        Thread thread1 = new Thread(() -> {
            for (int k = 0; k < matrixSize; k++) {
                for (int j = 0; j < matrixSize; j++) {
                    matrix[finalI][j] = matrix[finalI][j] + matriz1[finalI][k] * matriz2[k][j];
                    latchA.countDown();
                }
            }
        });
        thread1.start();
        threads.add(thread1);
        if (threads.size() % 100 == 0) {
            waitForThreads(threads);
        }
    }
    try {
        latchA.await();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    return matrix;
}
private void waitForThreads(List<Thread> threads) {
    for (Thread thread : threads) {
        try {
            thread.join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
    threads.clear();
}
I tried to create a new class that implements Runnable interface for multi-threading, but the performance dropped down even further.
The result time of 2 algorithms are:
First version: 0.0094
Second version: 1.5917
I'm studying about how to leverage CPU cache memory, and the first algorithm has had the best performance of all.
The repository is https://github.com/Borges360/matrix-multiplication.
In C, adding loop threads improves performance a lot.
The explanation in C: https://www.youtube.com/watch?v=o7h_sYMk_oc
 
    