why does numpy matrix multiply computation time increase by an order of magnitude at 100x100?

Question

When computing A @ a where A is a random N by N matrix and a is a vector with N random elements using numpy the computation time jumps by an order of magnitude at N=100. Is there any particular reason for this? As a comparison the same operation using torch on the cpu has a more gradual increase

Tried it with python3.10 and 3.9 and 3.7 with the same behavior

Code used for generating numpy part of the plot:

import numpy as np
from tqdm.notebook import tqdm
import pandas as pd
import time
import sys

def sym(A):
    return .5 * (A + A.T)

results = []
for n in tqdm(range(2, 500)):
    for trial_idx in range(10):
        A = sym(np.random.randn(n, n))
        a = np.random.randn(n)        
        
        t = time.time()
        for i in range(1000):
            A @ a
        t = time.time() - t
        results.append({
            'n': n,
            'time': t,
            'method': 'numpy',
        })
results = pd.DataFrame(results)

from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, 1)
ax.semilogy(results.n.unique(), results.groupby('n').time.mean(), label="numpy")
ax.set_title(f'A @ a timimgs (1000 times)\nPython {sys.version.split(" ")[0]}')
ax.legend()
ax.set_xlabel('n')
ax.set_ylabel('avg. time')

Update

Adding

import os
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1"

before ìmport numpy gives a more expected output, see this answer for details: https://stackoverflow.com/a/74662135/5043576

score 15 · Accepted Answer · answered Dec 02 '22 at 22:23

15

numpy tries to use threads when multiplying matricies of size 100 or larger, and the default CBLAS implementation of threaded multiplication is ... sub optimal, as opposed to other backends like intel-MKL or ATLAS.

if you force it to use only 1 thread using the answers in this post you will get a continuous line for numpy performance.

answered Dec 02 '22 at 22:23

Ahmed AEK

8,584
2
7
23

yes that was it! Updated the original post with the performance after setting threads to 1. Very strange... – Linus Dec 02 '22 at 23:28
1

More specifically, it is not Numpy the problem, but the default BLAS implementation of Numpy which is OpenBLAS. AFAIK, using another implementation like BLIS/MKL should be fine. Recompiling OpenBLAS is also an option. For more details, please read: [this post](https://stackoverflow.com/a/72669414/12939557). – Jérôme Richard Jan 08 '23 at 13:44

why does numpy matrix multiply computation time increase by an order of magnitude at 100x100?

Update

1 Answers1

Linked