TensorFlow dynamic RNN not training

Question

Problem statement

I am trying to train a dynamic RNN in TensorFlow v1.0.1 on Linux RedHat 7.3 (problem also manifests on Windows 7), and no matter what I try, I get the exact same training and validation error at every epoch, i.e. my weights are not updating.

I appreciate any help you can offer.

Example

I tried to reduce this to a minimum example that shows my issue, but the minimum example is still pretty large. I based the network structure largely on this gist.

Network definition

import functools
import numpy as np
import tensorflow as tf

def lazy_property(function):
    attribute = '_' + function.__name__

    @property
    @functools.wraps(function)
    def wrapper(self):
        if not hasattr(self, attribute):
            setattr(self, attribute, function(self))
        return getattr(self, attribute)
    return wrapper

class MyNetwork:
    """
    Class defining an RNN for labeling a time series.
    """

    def __init__(self, data, target, num_hidden=64):
        self.data = data
        self.target = target
        self._num_hidden = num_hidden
        self._num_steps = int(self.target.get_shape()[1])
        self._num_classes = int(self.target.get_shape()[2])
        self._weight_and_bias()  # create weight and bias tensors
        self.prediction
        self.error
        self.optimize

    @lazy_property
    def prediction(self):
        """Defines the recurrent neural network prediction scheme."""

        # Dynamic LSTM.
        network = tf.contrib.rnn.BasicLSTMCell(self._num_hidden)
        output, _ = tf.nn.dynamic_rnn(network, data, dtype=tf.float32)

        # Flatten and apply same weights to all time steps.
        output = tf.reshape(output, [-1, self._num_hidden])
        prediction = tf.nn.softmax(tf.matmul(output, self.weight) + self.bias)
        prediction = tf.reshape(prediction,
                                [-1, self._num_steps, self._num_classes])
        return prediction

    @lazy_property
    def cost(self):
        """Defines the cost function for the network."""

        cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction),
                                       axis=[1, 2])
        cross_entropy = tf.reduce_mean(cross_entropy)
        return cross_entropy

    @lazy_property
    def optimize(self):
        """Defines the optimization scheme."""

        learning_rate = 0.003
        optimizer = tf.train.RMSPropOptimizer(learning_rate)
        return optimizer.minimize(self.cost)

    @lazy_property
    def error(self):
        """Defines a measure of prediction error."""

        mistakes = tf.not_equal(tf.argmax(self.target, 2),
                                tf.argmax(self.prediction, 2))
        return tf.reduce_mean(tf.cast(mistakes, tf.float32))

    def _weight_and_bias(self):
        """Returns appropriately sized weight and bias tensors for the output layer."""

        self.weight = tf.Variable(tf.truncated_normal(
                                         [self._num_hidden, self._num_classes],
                                         mean=0.0,
                                         stddev=0.01,
                                         dtype=tf.float32))
        self.bias = tf.Variable(tf.constant(0.1, shape=[self._num_classes]))

Training

Here is my training process. The all_data class just holds my data and labels, and uses a batch generator class to spit out batches for training when I call all_data.train.next() and all_data.train_labels.next(). You can reproduce with any batch generation scheme you like, and I can add the code if you think it is relevant; I felt like this was getting too long as it is.

tf.reset_default_graph()
data = tf.placeholder(tf.float32,
                      [None, all_data.num_steps, all_data.num_features])
target = tf.placeholder(tf.float32,
                        [None, all_data.num_steps, all_data.num_outputs])
model = MyNetwork(data, target, NUM_HIDDEN)
print('Training the model...')
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print('Initialized.')
    for epoch in range(3):
        print('Epoch {} |'.format(epoch), end='', flush=True)
        for step in range(all_data.train_size // BATCH_SIZE):

            # Generate the next training batch and train.
            d = all_data.train.next()
            t = all_data.train_labels.next()
            sess.run(model.optimize,
                     feed_dict={data: d, target: t})

            # Update the user periodically.
            if step % summary_frequency == 0:
                print('.', end='', flush=True)

        # Show training and validation error at the end of each epoch.
        print('|', flush=True)
        train_error = sess.run(model.error,
                               feed_dict={data: d, target: t})
        valid_error = sess.run(model.error,
                               feed_dict={
                                   data: all_data.valid,
                                   target: all_data.valid_labels
                                   })
        print('Training error: {}%'.format(100 * train_error))
        print('Validation error: {}%'.format(100 * valid_error))

    # Check testing error after everything.
    test_error = sess.run(model.error,
                          feed_dict={
                              data: all_data.test,
                              target: all_data.test_labels
                              })
    print('Testing error after {} epochs: {}%'.format(epoch + 1, 100 * test_error))

For a simple example, I generated random data and labels, where data has shape [num_samples, num_steps, num_features], and each sample has a single label associated with the whole thing:

data = np.random.rand(5000, 1000, 2)
labels = np.random.randint(low=0, high=2, size=[5000])

I then converted my labels to one-hot vectors and tiled them so that the resulting labels tensor was the same size as the data tensor.

Results

No matter what I do, I get results like this:

Training the model...
Initialized.
Epoch  0 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch  1 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch  2 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Testing error after 3 epochs: 49.000000953674316%

Where I have exactly the same error at every epoch. Even if my weights were randomly walking around this should change. For the example shown here, I used random data with random labels, so I do not expect much improvement, but I do expect some change, and I am getting the exact same results every epoch. When I do this with my actual data set, I get the same behavior.

Insight

I hesitate to include this in case it proves to be a red herring, but I believe that my optimizer is calculating cost function gradients of None. When I tried a different optimizer and attempted to clip the gradients, I went ahead and used tf.Print to output the gradients as well. The network crashed with an error that tf.Print could not handle None-type values.

Attempted fixes

I have tried the following things, and the problem persists in all cases:

Using different optimizers, e.g. AdamOptimizer with and without modifications to the gradients (clipping).
Adjusting batch sizes.
Using many more and many fewer hidden nodes.
Running for more epochs.
Initializing my weights with different values assigned to stddev.
Initializing my biases to zeros (using tf.zeros) and to different constants.
Using weights and biases that are defined within the prediction method and are not member variables of the class, and a _weight_and_bias method that is defined as a @staticmethod like in this gist.
Determining logits in the prediction function instead of softmax predictions, i.e. predictions = tf.matmul(output, self.weights) + self.bias, and then using tf.nn.softmax_cross_entropy_with_logits. This requires some reshaping because the method wants its labels and targets given with shape [batch_size, num_classes], so the cost method becomes:

(line added to get code to format...)

@lazy_property
def cost(self):
"""Defines the cost function for the network."""
    targs = tf.reshape(self.target, [-1, self._num_classes])
    logits = tf.reshape(self.predictions, [-1, self._num_classes])
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=targs, logits=logits)
    cross_entropy = tf.reduce_mean(cross_entropy)
    return cross_entropy

Changing which size dimension I leave as None when I create my placeholders as suggested in this answer, which requires a bit of rewriting in the network definition. Basically setting size = [all_data.batch_size, -1, all_data.num_features] and size = [all_data.batch_size, -1, all_data.num_classes].
Using tf.contrib.rnn.DropoutWrapper in my network definition and passing a dropout value set to 0.5 in training and 1.0 in validation and testing.

I can give you some tips for debugging. Just use the simplest possible optimizer and don't worry about adjusting batch sizes, number of hidden nodes, weight initialization or dropout. Those are clearly not the problem. Also you don't have to train for three epochs to fix this. You can just train for one step. You don't even have to load new data every time. Just use the same batch over and over until you can see that your network is actually learning something. — Aaron, Apr 21 '17 at 18:46
That is good advice and actually helped a lot! At the very least, I can see if the network is training much sooner and expedite my process. Thanks. — Engineero, Apr 25 '17 at 22:58

score 1 · Accepted Answer · answered Apr 25 '17 at 22:56

The problem went away when I used

output = tf.contrib.layers.flatten(output)
logits = tf.contrib.layers.fully_connected(output, some_size, activation_fn=None)

instead of flattening my network output, defining weights, and performing the tf.matmul(output, weight) + bias manually. I then used logits (instead of predictions in the question) in my cost function with

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=target,
                                                        logits=logits)

If you want to get the network prediction, you will still need to do prediction = tf.nn.softmax(logits).

I have no idea why this helped, but the network would not train even on random made-up data until I made these changes.