Fused fast conversion from int16 to [-1.0, 1.0] float32 range in NumPy

Question

I'm looking for the fastest and most memory-economical conversion routine from int16 to float32 in NumPy. My usecase is conversion of audio samples, so real-world arrays are easily in 100K-1M elements range.

I came up with two ways. The first: converts int16 to float32, and then do division inplace. This would require at least two passes over the memory.

The second: uses divide directly and specifies an out-array that is in float32. Theoretically this should do only one pass over memory, and thus be a bit faster.

My questions:

Does the second way use float32 for division directly? (I hope it does not use float64 as an intermediate dtype)
In general, is there a way to do division in a specified dtype?
Do I need to specify some casting argument?
Same question about converting back from [-1.0, 1.0] float32 into int16

Thanks!

import numpy

a = numpy.array([1,2,3], dtype = 'int16')

# first
b = a.astype(numpy.float32)
c = numpy.divide(b, numpy.float32(32767.0), out = b)

# second
d = numpy.divide(a, numpy.float32(32767.0), dtype = 'float32')

print(c, d)

You don't have to divide the whole array, you can also multiply it by (1/32767) which is a tiny bit faster. I also tried it with numba (single threaded about the same as numpy, multi threaded version is about 60% faster for not too tiny arrays (1_000_000) elements and as fast as np.copy(a) — max9111, Aug 13 '20 at 15:02

score 0 · Accepted Answer · answered Aug 13 '20 at 19:12

Does the second way use float32 for division directly? (I hope it does not use float64 as an intermediate dtype)

Yes. You can check that by looking the code or more directly by scanning hardware events which clearly show that single floating point arithmetic instructions are executed (at least with Numpy 1.18).

In general, is there a way to do division in a specified dtype?

AFAIK, not directly with Numpy. Type promotion rules always apply. However, it is possible with Numba to perform conversions cell by cell which is much more efficient than using intermediate array (costly to allocate and to read/write).

Do I need to specify some casting argument?

This is not needed here since there is no loss of precision in this case. Indeed, in the first version the input operands are of type float32 as well as for the result. For the second version, the type promotion rule is automatically applied and a is implicitly casted to float32 before the division (probably more efficiently than the first method as no intermediate array could be created). The casting argument helps you to control the level of safety here (which is safe by default): for example, you can turn it to no to be sure that no cast occurs (for the both operands and the result, an error is raised if a cast is needed). You can see the documentation of can_cast for more information.

Same question about converting back from [-1.0, 1.0] float32 into int16

Similar answers applies. However, you should should care about the type promotion rules as float32 * int16 -> float32. Thus, the result of a multiply will have to be casted to int16 and a loss of accuracy appear. As a result, you can use the casting argument to enable unsafe casts (now deprecated) and maybe better performance.

Notes & Advises:

I advise you to use the Numba's @njit to perform the operation efficiently.

Note that modern processors are able to perform such operations very quickly if SIMD instructions are used. Consequently, the memory bandwidth and the cache allocation policy should be the two main limiting factors. Fast conversions can be archived by preallocating buffers, by avoiding the creation of new temporary arrays as well as by avoiding the copy of unnecessary (large) arrays.

thanks for the detailed response! figuring out how to look at the executed assembly would be very interesting as well. about Numba: does NumPy by itself not do this is SIMD? (division+cast) I'm doing this in data loading step for neural net training, so caching a buffer is not very easy, since controlling data loader thread-local buffers is currently not very easy in PyTorch, but I agree it's something worth trying for further optimization — Vadim Kantorov, Aug 13 '20 at 23:25
For now I got `f2s_numpy = lambda signal: np.multiply(signal, np.float32(32767), dtype = 'int16')` and `s2f_numpy = lambda signal: np.divide(signal, np.float32(32767), dtype = 'float32')`. Any comments — Vadim Kantorov, Aug 13 '20 at 23:33
Numpy [do not use SIMD instructions for all operations](https://numpy.org/neps/nep-0038-SIMD-optimizations.html), but [it does for simple ones](https://stackoverflow.com/q/44944367) such as the division (I checked on my PC and it used SSE although my processor support AVX). You can analyze Numpy with [`perf`](https://perf.wiki.kernel.org/index.php/Tutorial) on Linux. `s2f_numpy` looks fine but not `f2s_numpy`: I think you need to use `np.multiply(signal, np.float32(32767)).astype('int16')` (sadly not very fast) since `dtype='int16'` is not accepted and `casting='unsafe'` give wrong results. — Jérôme Richard, Aug 22 '20 at 23:09
Thanks for your correction! I created an issue about this: https://github.com/numpy/numpy/issues/17196 — Vadim Kantorov, Aug 31 '20 at 11:19

Fused fast conversion from int16 to [-1.0, 1.0] float32 range in NumPy

1 Answers1

Notes & Advises: