Why did Nvidia put both FP32 and FP64 units in the chip?
I think its about market penetration, to sell as many as possible. Without FP64, scientific research guys can't even try a demo of scientifically important gpgpu software that uses FP64(and even games could be using some double precision on an occasion). Without FP32, game physics and simulations would be very slow or GPU would need a nuclear reactor. Without FP16, no fast neural network. If there were only FP32, a neural network simulation would work at half speed or some FP64 summation wouldn't work.
Who knows, maybe in future there will be FP_raytrace dedicated cores that do raytracing ultra fast so no more DX12 DX11 DX9 painful upgradings and better graphics.
Ultimately, I wouldn't say no for an FPGA based GPU that can convert some of cores from FP64 to FP32 or some special function cores for an application, then converting all to FP64 for another application and even converting everything to a single fat core that is doing sequential work(such as compiling shaders). This would benefit for people doing many different things on a computer. For example, I may need more multiplications than additions and FPGA could help here. But now, money talks and it says "fixed function for now" and best income is achieved with a mixture of FP64 and FP32 (and FP16 lately).
Why not just put FP64 units that are capable of performing 2xFP32 operations
per instruction (like the SIMD instruction sets in CPUs).
SIMD expects always same operation for multiple data and less fun for scalar GPGPU kernels. ALso making 2xFP32 out of a FP64 would need more transistors than pure FP64, more heat, more latency maybe.
More transistors = more production failure probability so a 1024 FP32 GPU could be more probably produced than a 512 FP64_flexible GPU.
Why I can't use all FP32 and FP64 units at the same time?
Mixed precision computing can be done in cuda and opencl so you can get even faster using all cores but only applicable for non-memory-bottlenecked situations which is rare and hard to code.
Answer to edit 1:
here is a detailed source http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Accelerating_GPU_Computation_Through_Mixed-Precision_Methods.pdf
long story short, they don't add, there is "diminishing returns" that somehow not letting %100 scaling on all cores because of needed "extra cycles" between different precision calculations. When they are not mixed, then they need "extra iterations" between blocks which also not letting %100 scaling. It seems it is more useful as speeding "FP64" up instead of "FP32" down(but having many FP64 cores should be beneficious(for upping FP32), you could test them with something like a nbody kernel (which is not memory bottlenecked)). FP64 is very much memory consuming(and cache lines(and local memory)) thats why nbody alorithm I suggested which re-uses some data for N(>64k for example) times. My GPU has 1/24 FP64 power so I don't trust my computer. You have a titan? You should try, maybe its having %50 more power than its advertisement GFLOPs value.(but advertisement TDP value could be limiting its frequency that way, and melts down)
This source: http://www.nvidia.com/content/nvision2008/tech_presentations/NVIDIA_Research_Summit/NVISION08-Mixed_Precision_Methods_on_GPUs.pdf
says "outstanding performance and accuracy" but I couldn't find a physics solver for games using FP32 + FP32(truncated FP64), maybe its money talks again, if someone makes this, it would be "outstanding performance and meltdown" on gaming.(maybe worse than furmark exploding gpus)
people even use integers (integer dot product) on top of floats here : https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/
In case of CUDA, how is this achieved? Do I just use doubles and
floats at the same time in my kernel? Or do I need to pass some kind
of flag to NVCC?
an example to iterative refinement using fp64+fp32 in same function:
https://www.sciencesmaths-paris.fr/upload/Contenu/HM2012/07-dongarra_part2.pdf
pages 26-28.
For the opencl part, here is amd evergreen(hd5000 series) capable of issuing 1dp fma + 1 sp(or 1 sf) every cycle.
http://www.microway.com/download/whitepaper/gpgpu_architecture_and_performance_comparison_2010.pdf
I'll test something like an nbody on my R7-240 which is 1/24 or 1/26 th power of fp32 as fp64 tomorrow.
Edit: its working.
__kernel void sumGPU(__global float * a,__global float * b)
{
int idx = get_global_id(0);
float a0=a[idx];
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
//a0=convert_float(convert_double(a0)+2.0);
//a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
a0+=2.0f;
b[idx] = a0;
}
it switches between 13.02 ms and 12.85 ms when only one of the comments are disabled.
Note: cores are not fp32 themselves. There are no cores. There are schedulers binding hardware resources(fp32,fp64,special_function,registers) to kernel instructions of threads. Threads are also not real threads. So when you use fp32 then fp64 then fp32 then fp64_square_root, it will reserve necessary resources when they are needed. When not, they are options for other work items.(but a single work item can't use more than 1-2 fp32 ALUs I suspect(idk, I made this up))
Edit(2018/03): Is FP_raytrace(the second paragraph of this answer above) becoming reality?
(NVIDIA)
https://www.geforce.com/whats-new/articles/nvidia-rtx-real-time-game-ray-tracing
(AMD)
https://www.gamingonlinux.com/articles/amd-has-announced-radeon-rays-an-open-source-ray-tracing-sdk-using-vulkan.11461
Or is it another marketting gimmick? If it has hardware side, then raytracer people can work faster but it wouldn't be help for a moba gamer or ray-tracerless physics simulator. Why would I pay more for those ray tracers if I'm going to edit some videos? Maybe these can get segmented as others too, but for more money probably.