Parallelism while training neural networks can be achieved in two ways.
- Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
- Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.
In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
- Do gradient accumulation for larger batches
- Use DataParallel
- If that doesn't suffice, go with DistributedDataParallel