Dropout layer before or after LSTM. What is the difference?

Question

Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model.

ipt   = Input(shape = (shape[0], shape[1])
x     = Dropout(0.3)(ipt) ## Dropout before LSTM.
x     = CuDNNLSTM(10, return_sequences = False)(x)
out   = Dense(1, activation='relu')(x)

We can add Dropout layer before LSTM (like the above code) or after LSTM.

If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?
If we add it after LSTM and because return_sequences is False, what is dropout doing here?
Is there any different between dropout option in LSTM and dropout layer before LSTM layer?

Daniel Möller · Accepted Answer · 2019-11-07T16:20:25.640

10

As default, Dropout creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample)

You can, if you want, use the noise_shape property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination.

Dropping time steps: noise_shape = (1,steps,1)
Dropping features: noise_shape = (1,1, features)
Dropping samples: noise_shape = (None, 1, 1)

There is also the SpatialDropout1D layer, which uses noise_shape = (input_shape[0], 1, input_shape[2]) automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features).

After the LSTM you have shape = (None, 10). So, you use Dropout the same way you would use in any fully connected network. It drops a different group of features for each sample.

A dropout as an argument to the LSTM has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this).

Also, there is the option of recurrent_dropout, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.

edited Nov 07 '19 at 16:20

answered Nov 07 '19 at 16:11

Daniel Möller

84,878
18
192
214

+1, didn't know `'dropout'` as an argument generates per-gate masks. I wonder why `CuDNN` implementations found `recurrent_dropout` problematic though, or why there isn't a `CuDNNIndRNN` implementation yet. Guess 'funding' could answer both. – OverLordGoldDragon Nov 07 '19 at 16:20
1

Hmmm, good point. Maybe `CuDNNLSTM` is different from `LSTM`. I based my answer on `LSTM`. I think it's because the CuDNN version probably doesn't really has recursion (as GPUs are not good for that, they're great for pure parallel calculations) – Daniel Möller Nov 07 '19 at 16:21
How will it BPTT & pass hidden states along timesteps without recursion? From what I've read, `CuDNN` implems use algorithmic tricks to better utilize GPU resources, but it wouldn't be entirely parallelized – OverLordGoldDragon Nov 07 '19 at 16:31
1

I'm not certain of what I'm saying, it's just a guess. But I believe you can unroll the calculations in a bigger equation that would be solved at once, or in greater steps than the usual recurrent calculations. – Daniel Möller Nov 07 '19 at 16:35
@DanielMöller So based on my understanding, adding `noise_shape = (1,1, features)` to `Dropout` layer before LSTM works like a feature selection technique. Is it correct? – Eghbal Nov 07 '19 at 17:26
1

Yes, but it will treat all samples equally. (Thus it would require probably more epochs to get a good dropout variety). It would be best to do feature selection with `noise_shape=(None, 1, features)`, this will treat each sample differently, resulting in more variation. The easiest is to just use `SpatialDropout1D`. – Daniel Möller Nov 07 '19 at 17:32
@DanielMöller, Thanks. We know the main point behind dropout is preventing the model from overfitting. Can we expect this behaviour by adding `SpatialDropout1D` before LSTM? I think we also should consider `dropout` as an option inside of `LSTM` and probably using `Relu` as the activation function. So in this case, I think `LSTM` is a better choice compared to `CuDNNLSTM` because of its flexibility. Also, I think adding timestamps (`noise_shape=(None, steps, features)`) is a bit strange. – Eghbal Nov 07 '19 at 17:49
1

Relu is not recommended for LSTM. It's better to stick with the standard activation. (It's very easy to explode using the same weights recurrently) -- Every dropout will help with overfitting, some may cause a bad effect in learning. I can't say which is best. – Daniel Möller Nov 07 '19 at 18:50
@DanielMöller As you said, if we add `SpatialDropout1D` before LSTM, it works as a feature selection. But it's just generating random numbers so how is it going to find the best combination because optimization algorithm is not optimizing anything in this layer. – Eghbal Nov 08 '19 at 11:55
1

It's not an "intelligent" feature selection, dropouts will never try to find a good combination. Maybe I didn't understand the question or didn't know what the term "feature selection" could mean. The function of the dropout layer is just to add noise so the model learns to generalize better. It's a regularization technique (maybe also seen as augmentation). I said "feature selection" because it drops the same features for all time steps instead of dropping just anything as the standard dropout. – Daniel Möller Nov 08 '19 at 12:00

score 4 · Answer 2 · answered Nov 07 '19 at 13:48

You are confusing Dropout with it's variant SpatialDropoutND (either 1D, 2D or 3D). See documentation (apparently you can't link specific class).

Dropout applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.
Here, if return_sequences=False, you only get output from last timestep, so it would be of size [batch, 10] in your case. Dropout will randomly drop value from the second dimension
Yes, there is a difference, as dropout is for time steps when LSTM produces sequences (e.g. sequences of 10 goes through the unrolled LSTM and some of the features are dropped before going into the next cell). Dropout would drop random elements (except batch dimension). SpatialDropout1D would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use SpatialDropout2D to drop channels, either input or along the network).

Dropout layer before or after LSTM. What is the difference?

2 Answers2