I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
- https://stackoverflow.com/a/62076913/1227353
- https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
- https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
  # labels_batch contains labels of different sizes
  for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
    outputs_batch = model(inputs_batch)
    # have to do this because labels can't be stacked into a tensor
    for output, label in zip(outputs_batch, labels_batch):
      output_scaled = interpolate(...)  # make output match label size
      loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
      loss.backward()
    if batch_idx % 2 == 1:
      optimizer.step()
      optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
  # labels_batch contains labels of different sizes
  for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
    outputs_batch = model(inputs_batch)
    # CHANGE: we're gonna accumulate losses manually
    batch_loss = 0
    # have to do this because labels can't be stacked into a tensor
    for output, label in zip(outputs_batch, labels_batch):
      output_scaled = interpolate(...)  # make output match label size
      loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
      batch_loss += loss # CHANGE: accumulate!
    # CHANGE: do backprop outside for loop
    batch_loss.backward()
    if batch_idx % 2 == 1:
      optimizer.step()
      optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though...   (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
 
    