Does the second way use float32 for division directly? (I hope it does not use float64 as an intermediate dtype)
Yes. You can check that by looking the code or more directly by scanning hardware events which clearly show that single floating point arithmetic instructions are executed (at least with Numpy 1.18).
In general, is there a way to do division in a specified dtype?
AFAIK, not directly with Numpy. Type promotion rules always apply. However, it is possible with Numba to perform conversions cell by cell which is much more efficient than using intermediate array (costly to allocate and to read/write).
Do I need to specify some casting argument?
This is not needed here since there is no loss of precision in this case. Indeed, in the first version the input operands are of type float32 as well as for the result. For the second version, the type promotion rule is automatically applied and a is implicitly casted to float32 before the division (probably more efficiently than the first method as no intermediate array could be created). The casting argument helps you to control the level of safety here (which is safe by default): for example, you can turn it to no to be sure that no cast occurs (for the both operands and the result, an error is raised if a cast is needed). You can see the documentation of can_cast for more information.
Same question about converting back from [-1.0, 1.0] float32 into int16
Similar answers applies. However, you should should care about the type promotion rules as float32 * int16 -> float32. Thus, the result of a multiply will have to be casted to int16 and a loss of accuracy appear. As a result, you can use the casting argument to enable unsafe casts (now deprecated) and maybe better performance.
Notes & Advises:
I advise you to use the Numba's @njit to perform the operation efficiently.
Note that modern processors are able to perform such operations very quickly if SIMD instructions are used. Consequently, the memory bandwidth and the cache allocation policy should be the two main limiting factors. Fast conversions can be archived by preallocating buffers, by avoiding the creation of new temporary arrays as well as by avoiding the copy of unnecessary (large) arrays.