Why prediction on activation values (Softmax) gives incorrect results?

Question

I've implemented a basic neural network from scratch using Tensorflow and trained it on MNIST fashion dataset. It's trained correctly and outputs testing accuracy around ~88-90% over 10 classes.

Now I've written predict() function which predicts the class of given image using trained weights. Here is the code:

def predict(images, trained_parameters):

    Ws, bs = [], []
    parameters = {}

    for param in trained_parameters.keys():
        parameters[param] = tf.convert_to_tensor(trained_parameters[param])

    X = tf.placeholder(tf.float32, [images.shape[0], None], name = 'X')
    Z_L = forward_propagation(X, trained_parameters)

    p = tf.argmax(Z_L) # Working fine
    # p = tf.argmax(tf.nn.softmax(Z_L)) # not working if softmax is applied

    with tf.Session() as session:
        prediction = session.run(p, feed_dict={X: images})

    return prediction

This uses forward_propagation() function which returns the weighted sum of the last layer (Z) and not the activitions (A) because of TensorFlows tf.nn.softmax_cross_entropy_with_logits() requires Z instead of A as it will calculate A by applying softmax Refer this link for details.

Now in predict() function, when I make predictions using Z instead of A (activations) it's working correctly. By if I calculate softmax on Z (which is activations A of the last layer) it's giving incorrect predictions.

Why it's giving correct predictions on weighted sums Z? We are not supposed to first apply softmax activation (and calculate A) and then make predictions?

Here is the link to my colab notebook if anyone wants to look at my entire code: Link to Notebook Gist

So what am I missing here?

rvinas · Accepted Answer · 2019-08-04T17:07:32.360

Most TF functions, such as tf.nn.softmax, assume by default that the batch dimension is the first one - that is a common practice. Now, I noticed in your code that your batch dimension is the second, i.e. your output shape is (output_dim=10, batch_size=?), and as a result, tf.nn.softmax is computing the softmax activation along the batch dimension.

There is nothing wrong in not following the conventions - one just needs to be aware of them. Computing the argmax of the softmax along the first axis should yield the desired results (it is equivalent to taking the argmax of the logits):

p = tf.argmax(tf.nn.softmax(Z_L, axis=0))

Also, I would also recommend computing the argmax along the first axis in case more than one image is fed into the network.

Why prediction on activation values (Softmax) gives incorrect results?

1 Answers1