Strange performance results for numpy matrix multiplication

Question

Recently I've discovered a case in which matrix multiplication with numpy shows very strange performance (at least to me). To illustrate it I've created an example of such matrices and a simple script to demonstrate the timings. Both can be downloaded from the repo, and I don't include the script here because it's of little use without the data.

The script multiplies two pairs of matrices (each pair is the same in terms of shape and dtype, only the data differs) in different ways using both dot function and einsum. Actually, I've noticed several anomalies:

The first pair (A * B) is multiplied much faster than the second one (C * D).
When I convert all matrices to float64, the times become the same for both pairs: longer than it took to multiply A * B, but shorter than C * D.
These effects remain for both einsum (numpy implementation, as I understand) and dot (uses BLAS at my machine). For the sake of completeness, the output of this script at my laptop:

With np.dot:
A * B: 0.142910003662 s
C * D: 4.9057161808 s
A * D: 0.20524597168 s
C * B: 4.20220398903 s
A * B (to float32): 0.156805992126 s
C * D (to float32): 5.11792707443 s
A * B (to float64): 0.52608704567 s
C * D (to float64): 0.484733819962 s
A * B (to float64 to float32): 0.255760908127 s
C * D (to float64 to float32): 4.7677090168 s
With einsum:
A * B: 0.489732980728 s
C * D: 7.34477996826 s
A * D: 0.449800014496 s
C * B: 4.05954909325 s
A * B (to float32): 0.411967992783 s
C * D (to float32): 7.32073783875 s
A * B (to float64): 0.80580997467 s
C * D (to float64): 0.808521032333 s
A * B (to float64 to float32): 0.414498090744 s
C * D (to float64 to float32): 7.32472801208 s

How can such results be explained, and how to multiply C * D faster, like A * B?

http://stackoverflow.com/q/13964606/270986 is more-or-less the same question, and has a nice answer from @EricPostpischil. Essentially, the computation of `C*D` is hitting the subnormal range of the `float32` type, and depending on your processor, operations with subnormal numbers can be many times slower than with normal numbers. When you convert to `float64` before multiplying, subnormals are no longer involved. — Mark Dickinson, Jul 09 '14 at 20:39
@MarkDickinson: thank for pointing at it! Could you post it as an answer, because it actually answers the first part of the question? And what can be done to flush denormals to zero in such cases? I tried compiling a Cython function which calls `np.dot` with gcc arguments `-ffast-math -msse -mfpmath=sse`, but no difference. I care only about the dot product from BLAS. — aplavin, Jul 10 '14 at 04:05
Will do. Sorry; didn't have time earlier to write a considered reply. — Mark Dickinson, Jul 10 '14 at 17:25

score 4 · Accepted Answer · edited May 23 '17 at 12:20

The slowdown you're seeing is due to calculations involving subnormal numbers. Many processors are much slower when performing arithmetic operations with subnormal inputs or outputs. There are a couple of existing StackOverflow questions that are related: see this related C# question (and in particular the answer from Eric Postpischil), and this answer to a C++ question for more information.

In your specific case, the matrix C (with a dtype of float32) contains several subnormal numbers. For single-precision floats, the subnormal / normal boundary is 2^-126, or around 1.18e-38. Here's what I see for C:

>>> ((0 < abs(C)) & (abs(C) < 2.0**-126)).sum()  # number of subnormal entries
44694
>>> C.size
682450

So around 6.5% of C's entries are subnormal, which is more than enough to slow down the C*B and C*D multiplications. In contrast, A and B don't go near the subnormal boundary:

>>> abs(A[A != 0]).min()
4.6801152e-12
>>> abs(B[B != 0]).min()
4.0640174e-07

So none of the intermediate values involved in the A*B matrix multiplication is subnormal, and no speed penalty applies.

As to the second part of your question, I'm not sure what to suggest. If you try hard enough, and you're using x64/SSE2 (rather than the x87 FPU), you can set the flush-to-zero and denormals-are-zero flags from Python. See this answer for a crude and non-portable ctypes-based hack; if you really want to follow this route, writing a custom C extension to do this might be a better bet.

I'd be tempted instead to try scaling C to bring it entirely into the normal range (and to bring the individual products from C*D into the normal range, too), but that might not be possible if C also has values at the upper extremes of the floating-point range. Alternatively, simply replacing the tiny values in C with zeros might work, but whether the resulting accuracy loss is significant and/or acceptable would depend on your application.

I would say, it's hard if possible to guess the reason of such a performance anomaly without having ever heard about denormals being much slower on hardware (actually, why not just disable them by default...). — aplavin, Jul 10 '14 at 18:02
And thanks for suggesting the idea to scale C and D several orders of magnitude higher - it's easily possible in my case, as values can't be greater than 1 there. I myself kinda solved this issue by manually flushing numbers below 1e-20 to zero - it's much more precision than I need, too. However, scaling seems more _correct_ approach, so I'll stick to it. — aplavin, Jul 10 '14 at 18:05

score 1 · Answer 2 · answered Jul 09 '14 at 21:16

Mark Dickinson already answered your question, but just for fun, try this:

Cp = np.array(list(C[:,0]))
Ap = np.array(list(A[:,0]))

This removes the splicing delay and makes sure that the arrays are similar in memory.

%timeit Cp * Cp   % 34.9 us per loop
%timeit Ap * Ap   % 3.59 us per loop

Whoops.

Strange performance results for numpy matrix multiplication

2 Answers2