Recently I've discovered a case in which matrix multiplication with numpy shows very strange performance (at least to me). To illustrate it I've created an example of such matrices and a simple script to demonstrate the timings. Both can be downloaded from the repo, and I don't include the script here because it's of little use without the data.
The script multiplies two pairs of matrices (each pair is the same in terms of shape and dtype, only the data differs) in different ways using both dot function and einsum. Actually, I've noticed several anomalies:
- The first pair (
A * B) is multiplied much faster than the second one (C * D). - When I convert all matrices to
float64, the times become the same for both pairs: longer than it took to multiplyA * B, but shorter thanC * D. - These effects remain for both
einsum(numpy implementation, as I understand) anddot(uses BLAS at my machine). For the sake of completeness, the output of this script at my laptop:
With np.dot: A * B: 0.142910003662 s C * D: 4.9057161808 s A * D: 0.20524597168 s C * B: 4.20220398903 s A * B (to float32): 0.156805992126 s C * D (to float32): 5.11792707443 s A * B (to float64): 0.52608704567 s C * D (to float64): 0.484733819962 s A * B (to float64 to float32): 0.255760908127 s C * D (to float64 to float32): 4.7677090168 s With einsum: A * B: 0.489732980728 s C * D: 7.34477996826 s A * D: 0.449800014496 s C * B: 4.05954909325 s A * B (to float32): 0.411967992783 s C * D (to float32): 7.32073783875 s A * B (to float64): 0.80580997467 s C * D (to float64): 0.808521032333 s A * B (to float64 to float32): 0.414498090744 s C * D (to float64 to float32): 7.32472801208 s
How can such results be explained, and how to multiply C * D faster, like A * B?