I'm creating a video captioning seq2seq model.
My encoder inputs are video features, and my decoder inputs are captions, beggining with a token and padded with tokens.
Problem: During the teacher forcing training period, after few iterations, it only outputs tokens, for the rest of epochs.
My problem is very similar to those stack overflow posts:
- Seq2Seq model learns to only output EOS token (<\s>) after a few iterations
 - Tensorflow seq2seq chatbot always give the same outputs
 
However, I'm sure that I'm using the right shapes for computing tf.contrib.seq2seq.sequence_loss.
My inputs seem also correct: - my ground-truth target captions begin with token and are padded with tokens. - the predicted captions don't start with tokens
I tried to:
- use another loss functions (mean of f.nn.sparse_softmax_cross_entropy_with_logits)
 - keep the  token at the end of captions but doing padding with special  tokens. So my captions looks like: 
<start> This is my caption <end> <pad> <pad> ... <pad>. This resulted to NaN logits... - change my embeddings method
 - take more data. I trained the model with 512 videos and a batch size of 64
 
Here is my simple model:
def decoder(target, hidden_state, encoder_outputs):
  with tf.name_scope("decoder"):
    embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dims, name="embeddings")  
    dec_embeddings = embeddings
    decoder_inputs = embeddings(target) 
    decoder_gru_cell = tf.nn.rnn_cell.GRUCell(dec_units, name="gru_cell")
    output_layer = tf.layers.Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    # Training decoder
    with tf.variable_scope("decoder"):
      training_helper = tf.contrib.seq2seq.TrainingHelper(decoder_inputs, batch_size*[max_length])
      training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_gru_cell, training_helper, decoder_initial_state, output_layer)
      training_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, maximum_iterations=max_length)
  # This is the logits
  return training_decoder_outputs.rnn_outputs
For the caption:
Real caption:
<start> a girl and boy flirt then eat food <end> <end> <end> <end> 
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> 
Here are the predictions:
Epoch 1:
Predicted caption:
show show show show show show show show show show show show show show     
show show show show show show show show show show show show show 
Epoch 2:
Predicted caption:
the the the the the the the the the the the the the the the the the the 
the the the the the the the the the 
...
Epoch 7:
Predicted caption:
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> 
And it stays like Epoch 7 for every other epochs...
Note that my model seems to optimize loss correctly !