I am using LSTM Networks for Multivariate Multi-Timestep predictions.
So basically seq2seq prediction where a number of n_inputs is fed into the model in order to predict a number of n_outputs of a time series.
My question is how to meaningfully apply Dropout and BatchnNormalization as this appears to be a highly discussed topic for Recurrent and therefore LSTM Networks. Lets stick to Keras as framework for the sake of simplicity.
Case 1: Vanilla LSTM
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs, n_features), dropout=dropout_rate))
model.add(Dense(int(n_blocks/2)))
model.add(BatchNormalization())
model.add(Activation(activation))
model.add(Dense(n_outputs))
- Q1: Is it good practice not to use BatchNormalization directly after LSTM layers?
- Q2: Is it good practice to use Dropout inside LSTM layer?
- Q3: Is the usage of BatchNormalization and Dropout between the Dense layers good practice?
- Q4: If I stack multiple LSTM layers, is it a good idea to use BatchNormalization between them?
Case 2: Encoder Decoder like LSTM with TimeDistributed Layers
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs,n_features), dropout=dropout_rate))
model.add(RepeatVector(n_outputs))
model.add(LSTM(n_blocks, activation=activation, return_sequences=True, dropout=dropout_rate))
model.add(TimeDistributed(Dense(int(n_blocks/2)), use_bias=False))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Activation(activation)))
model.add(TimeDistributed(Dropout(dropout_rate)))
model.add(TimeDistributed(Dense(1)))
- Q5: Should BatchNormalozationandDropoutwrapped insideTimeDistributedlayers when used betweenTimeDistributed(Dense())layers, or is it correct to leave them without?
- Q6: Can or should Batchnormalization be applied after, before, or in between the Encoder-Decoder LSTM Blocks?
- Q7: If a - ConvLSTM2Dlayer is used as first Layer (Encoder) would this make a difference in the usage of Dropout and BatchNormalization?
- Q8: should the - recurrent_dropoutargument be used inside LSTM blocks? If yes should it be combined with normal- dropoutargument as it is in the example, or should it be exchanged? Thank you very much in advance!
