I have a Scheduler class which contains a list of Client objects, all with their own Pytorch models, parameters and training functions. I am trying to train multiple clients in parallel as I have multiple GPUs and each Client is assigned a GPU.
The basic code structure is like this:
import torch.multiprocessing as mp
class Scheduler:
def __init__(self, num_clients):
self.clients = [] # Client1, ..., ClientN
def client_update(self, client):
print("Client {}".format(client.id))
client.train()
client.evaluate(self.dataset.test_dataloader)
def train(self, num_rounds):
for round in range(num_rounds):
processes = []
for client in self.clients:
process = mp.Process(target=self.client_update, args=(client, ))
process.start()
processes.append(process)
for process in processes:
process.join()
The Scheduler class is initialised in the main script and the train function is called there. Within the if guard I set mp.set_start_method('spawn', force=True).
This method doesn't seem to work as, the Process creates a new Client object and I run into an EOFError: Ran out of input error, similar to this. Unfortunately I cannot use the same solution as in this thread.
Tried using a Pool method but couldn't get that working unfortunately.
ctx = mp.get_context('forkserver')
pool = ctx.Pool(2)
pool.map(
functools.partial(self.client_update,),
self.clients)
pool.close()
I am unsure what the best method would be to be able to use the GPUs efficiently to speed up the process for the clients.