How to implement the Softmax function in Python

Question

From the Udacity's deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:

Where S(y_i) is the softmax function of y_i and e is the exponential and j is the no. of columns in the input vector Y.

I've tried the following:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

which returns:

[ 0.8360188   0.11314284  0.05083836]

But the suggested solution was:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.

Can someone show mathematically why? Is one correct and the other one wrong?

Are the implementation similar in terms of code and time complexity? Which is more efficient?

I'm curious why you attempted to implement it in this way with a max function. What made you think of it in that way? — BBischof, Jan 26 '16 at 01:14
I don't know, i thought treating the maximum as 0 and sort of like moving the graph to the left and clip at 0 helps. Then my range sort of shorten from `-inf to +inf` to `-inf to 0`. I guess I was overthinking. hahahaaa — alvas, Jan 26 '16 at 01:27
I still have one sub) questions which doesn't seem to answered below. What is the significance of `axis = 0` in the suggested answer by Udacity? — Parva Thakkar, Jan 26 '16 at 19:57
if you take a look at the numpy documentation, it discusses what sum(x, axis=0)--and similarly axis=1-- does. In short, it provides the direction in which to sum an array of arrays. In this case, it tells it to sum along the vectors. In this case, that corresponds to the denominators in the softmax function. — BBischof, Jan 26 '16 at 22:24
It's like every other week, there's a more correct answer till the point where my math isn't good enough to decide who's correct =) Any math whiz who didn't provide an answer can help decide which is correct? — alvas, Jul 10 '16 at 23:06
Both solutions are equivalent in terms of math. However, you solution is better because it avoids the potential overflow issue when taking `exp` — Louis Yang, Jun 21 '20 at 23:11
my two pennies here: **do not implement softmax yourself**, unless this is for educational purposes. There are some numerical stability tricks that need to be addressed while implementing, so it is better to find once of the. multiple implementations already available (tensofrlow, pytorch, scipy, etc) — Rodrigo Laguna, Mar 12 '23 at 06:38

score 179 · Accepted Answer · edited Apr 11 '18 at 15:24

179

They're both correct, but yours is preferred from the point of view of numerical stability.

You start with

e ^ (x - max(x)) / sum(e^(x - max(x))

By using the fact that a^(b - c) = (a^b)/(a^c) we have

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.

edited Apr 11 '18 at 15:24

Utkan Gezer

3,009
2
16
29

answered Jan 23 '16 at 22:00

Trevor Merrifield

4,541
2
21
24

4

Reformatting your answer @TrevorM for further clarification: e ^ (x - max(x)) / sum(e^(x - max(x)) using a^(b - c) = (a^b)/(a^c) we have, = e^ x / {e ^ max(x) * sum(e ^ x / e ^ max(x))} = e ^ x / sum(e ^ x) – shanky_thebearer Jan 28 '16 at 13:44
5

@Trevor Merrifield, I dont think the first approach had got any "unnecessary term". In fact it is better than the second approach. I have added this point as a seperate answer. – Shagun Sodhani Feb 08 '16 at 18:15
8

@Shagun You are correct. The two are mathematically equivalent but I hadn't considered numerical stability. – Trevor Merrifield Feb 08 '16 at 18:30
Hope you don't mind: I edited out "unnecessary term" in case people don't read the comments (or the comments disappear). This page get quite a bit of traffic from search-engines and this is currently the first answer people see. – Alex Riley Mar 04 '17 at 19:54
I wonder why you subtract max(x) and not max(abs(x)) (fix the sign after determining the value). If all your values are below zero and very large in their absolute value, and only value (the maximum) is close to zero, subtracting the maximum will not change anything. Wouldn't it still be numerically unstable? – Cerno May 08 '17 at 12:00
IMPORTANT (for whoever lands here): As for implementation with matrices, @ChuckFive solution is the correct way to calculate the softmax. – Yuval Atzmon May 24 '17 at 08:18
@TrevorMerrifield would you please say why the latter is numerically stable? – Green Falcon Feb 19 '18 at 11:40
more explanation of softmax https://machinelearningmastery.com/softmax-activation-function-with-python/ – Alex Punnen Jan 07 '22 at 12:22

desertnaut · Answer 2 · 2020-05-11T16:27:22.370

(Well... much confusion here, both in the question and in the answers...)

To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let's try your solution (your_softmax) and one where the only difference is the axis argument:

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

As I said, for a 1-D score array, the results are indeed identical:

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail - the axis argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...

The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some other answers here).

Well, if you are just talking about multi-dimensional array. The first solution can be easily fixed by adding `axis` argument to both `max` and `sum`. However, the first implementation is still better since you can easily overflow when taking `exp` — Louis Yang, Jun 21 '20 at 23:15
@LouisYang I'm not following; which is the "first" solution? Which one does *not* use `exp`? What more has been modified here other than adding an `axis` argument? — desertnaut, Jun 22 '20 at 00:25
The first solution refer to the solution from @alvas. The difference is that the suggested solution in alvas's question is missing the part of subtracting the max. This can easily causing overflow for example, exp(1000) / (exp(1000) + exp(1001)) vs exp(-1) / (exp(-1) + exp(0)) are the same in math but the first one will overflow. — Louis Yang, Jun 22 '20 at 04:53
@LouisYang still, not sure I understand the necessity of your comment - all this has already been addressed explicitly in the answer. — desertnaut, Jun 22 '20 at 08:46
@LouisYang please do not let the (subsequent) popularity of the thread fool you, and try to imagine the context where own answer was offered: a puzzled OP ("*both give the same result*"), and a (still!) accepted answer claiming that "*both are correct*" (well, they are *not*). The answer was never meant to be "*that's the most correct & efficient way to compute softmax in general*"; it just meant to justify **why**, in the *specific* Udacity quiz discussed, the 2 solutions are **not** equivalent. — desertnaut, Jun 22 '20 at 13:27

ChuckFive · Answer 3 · 2016-09-19T13:35:08.363

So, this is really a comment to desertnaut's answer but I can't comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut's solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

Lets take desertnauts example:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

This is the output:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

I hope you can see that this is only the case with my solution.

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

Additionally, here is the results of TensorFlows softmax implementation:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

And the result:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True) reaches the same result as your softmax function. the steps with s are unnecessary. — PabTorre, Nov 21 '16 at 21:08
In place of` `s = s[:, np.newaxis]` , `s = s.reshape(z.shape[0],1)` should also work. — Debashish, Dec 15 '17 at 09:25
so many incorrect/inefficient solutions on this page. Do yourselves a favour and use PabTorre's — tea_pea, Dec 17 '18 at 11:09
@PabTorre did you mean axis=-1? axis=1 won't work for single dimensional input — Nihar Karve, May 04 '20 at 09:18
The "`s`" operations are required to ensure the softmax function is numerically stable. It may be fine for school projects, but it is invaluable for building models in production. — rayryeng, Dec 01 '20 at 06:05
@tea_pea the question was never "*what is the most efficient softmax implementation?*" in the first place; it was if & why the solution by the OP is actually different from the one suggested by the course authors, and which one is correct - nothing more. — desertnaut, Aug 14 '23 at 01:10

score 44 · Answer 4 · answered Feb 08 '16 at 18:13

44

I would say that while both are correct mathematically, implementation-wise, first one is better. When computing softmax, the intermediate values may become very large. Dividing two large numbers can be numerically unstable. These notes (from Stanford) mention a normalization trick which is essentially what you are doing.

answered Feb 08 '16 at 18:13

Shagun Sodhani

3,535
4
30
41

3

The effects of catastrophic cancellation cannot be underestimated. – Cesar Jun 07 '16 at 17:49

score 29 · Answer 5 · answered Jul 28 '17 at 07:25

29

sklearn also offers implementation of softmax

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]])

answered Jul 28 '17 at 07:25

Roman Orac

1,562
15
18

7

How exactly this answers the specific question, which is about the *implementation* itself and not about the availability in some third-party library? – desertnaut Jul 16 '18 at 15:04
19

I was looking for a third party implementation to verify the results of both approaches. This is the way this comment helps. – Eugenio F. Martinez Pacheco Jul 30 '18 at 08:02

score 18 · Answer 6 · edited Aug 06 '19 at 11:35

From mathematical point of view both sides are equal.

And you can easily prove this. Let's m=max(x). Now your function softmax returns a vector, whose i-th coordinate is equal to

notice that this works for any m, because for all (even complex) numbers e^m != 0

from computational complexity point of view they are also equivalent and both run in O(n) time, where n is the size of a vector.
from numerical stability point of view, the first solution is preferred, because e^x grows very fast and even for pretty small values of x it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feed x = np.array([1000, 5]) into both of your functions. One will return correct probability, the second will overflow with nan
your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use sum(axis=0)

When it usefull to be able to calculate softmax on matrix rather on vector? i.e. what models output matrix? Can it be even more dimensional? — mrgloom, Dec 14 '17 at 04:40
do you mean the *first solution* in "from numerical stability point of view, the second solution is preferred..."? — Dataman, Mar 02 '18 at 13:08

Nolan Conaway · Answer 7 · 2019-01-03T17:26:54.643

EDIT. As of version 1.2.0, scipy includes softmax as a special function:

https://scipy.github.io/devdocs/generated/scipy.special.softmax.html

I wrote a function applying the softmax over any axis:

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.

score 10 · Answer 8 · edited Jun 29 '16 at 20:09

10

Here you can find out why they used - max.

From there:

"When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick."

edited Jun 29 '16 at 20:09

Tonechas

13,398
16
46
80

answered Jun 29 '16 at 19:09

Sadegh Salehi

710
6
6

score 5 · Answer 9 · answered Feb 15 '18 at 19:38

5

To offer an alternative solution, consider the cases where your arguments are extremely large in magnitude such that exp(x) would underflow (in the negative case) or overflow (in the positive case). Here you want to remain in log space as long as possible, exponentiating only at the end where you can trust the result will be well-behaved.

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

answered Feb 15 '18 at 19:38

PikalaxALT

327
3
12

To make it equal to the posters code, you need to add `axis=0` as an argument to `logsumexp`. – Björn Lindqvist Apr 26 '18 at 12:01
Alternatively, one could unpack extra args to pass to logsumexp. – PikalaxALT Apr 27 '18 at 13:40

Rub · Answer 10 · 2020-12-06T11:43:33.827

I was curious to see the performance difference between these

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

def softmaxv2(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def softmaxv3(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / np.sum(e_x, axis=0)

def softmaxv4(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)



x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]

Using

print("----- softmax")
%timeit  a=softmax(x)
print("----- softmaxv2")
%timeit  a=softmaxv2(x)
print("----- softmaxv3")
%timeit  a=softmaxv2(x)
print("----- softmaxv4")
%timeit  a=softmaxv2(x)

Increasing the values inside x (+100 +200 +500...) I get consistently better results with the original numpy version (here is just one test)

----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop

Until.... the values inside x reach ~800, then I get

----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
  after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop

As some said, your version is more numerically stable 'for large numbers'. For small numbers could be the other way around.

score 4 · Answer 11 · answered Sep 06 '16 at 20:08

4

A more concise version is:

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

answered Sep 06 '16 at 20:08

Pimin Konstantin Kefaloukos

1,560
3
14
31

15

this can run into arithmetic overflow – minhle_r7 Sep 18 '16 at 14:40

score 3 · Answer 12 · edited Mar 10 '19 at 12:17

3

I would suggest this:

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

It will work for stochastic as well as the batch.
For more detail see : https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

edited Mar 10 '19 at 12:17

Hossein

24,202
35
119
224

answered Aug 18 '18 at 09:44

Ravish Kumar Sharma

238
2
7

score 3 · Answer 13 · answered Jan 20 '19 at 21:13

3

I needed something compatible with the output of a dense layer from Tensorflow.

The solution from @desertnaut does not work in this case because I have batches of data. Therefore, I came with another solution that should work in both cases:

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

Results:

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

Ref: Tensorflow softmax

answered Jan 20 '19 at 21:13

Lucas Casagrande

478
5
8

Just keep in mind that the answer refers to a *very specific setting* described in the question; it was never meant to be 'how to compute the softmax in general under any circumstances, or in the data format of your liking'... – desertnaut Jan 20 '19 at 22:40
I see, I've put this here because the question refers to "Udacity's deep learning class" and it would not work if you are using Tensorflow to build your model. Your solution is cool and clean but it only works in a very specific scenario. Thanks anyway. – Lucas Casagrande Jan 20 '19 at 23:22

score 1 · Answer 14 · answered Nov 06 '16 at 15:52

In order to maintain for numerical stability, max(x) should be subtracted. The following is the code for softmax function;

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

score 1 · Answer 15 · answered Dec 15 '17 at 10:04

Already answered in much detail in above answers. max is subtracted to avoid overflow. I am adding here one more implementation in python3.

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

score 1 · Answer 16 · answered Oct 17 '18 at 04:25

Everybody seems to post their solution so I'll post mine:

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

I get the exact same results as the imported from sklearn:

from sklearn.utils.extmath import softmax

score 1 · Answer 17 · answered Jul 13 '19 at 14:38

1

import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

answered Jul 13 '19 at 14:38

King

41
4

Welcome to SO. An explanation of how your code answers the question is always helpful. – Nick Jul 13 '19 at 14:56

score 1 · Answer 18 · answered Oct 19 '19 at 12:48

Based on all the responses and CS231n notes, allow me to summarise:

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

Usage:

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

Output:

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

score 1 · Answer 19 · answered Apr 03 '20 at 16:50

The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.

Softmax function is used when we have multiple classes.

It is useful for finding out the class which has the max. Probability.

The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.

It ranges from 0 to 1.

Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.

The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

For example:

Before softmax

X = [13, 31, 5]

After softmax

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

Code:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

score 1 · Answer 20 · answered Aug 26 '20 at 04:35

This also works with np.reshape.

   def softmax( scores):
        """
        Compute softmax scores given the raw output from the model

        :param scores: raw scores from the model (N, num_classes)
        :return:
            prob: softmax probabilities (N, num_classes)
        """
        prob = None

        exponential = np.exp(
            scores - np.max(scores, axis=1).reshape(-1, 1)
        )  # subract the largest number https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/
        prob = exponential / exponential.sum(axis=1).reshape(-1, 1)

        

        return prob

score 0 · Answer 21 · answered Jul 16 '17 at 02:00

I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.

Here I give you some suggestions:

To get max, try to do it along x-axis, you will get an 1D array.
Reshape your max array to original shape.
Do np.exp get exponential value.
Do np.sum along axis.
Get the final results.

Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don't understand.

It is not related to any college homework, only to an ungraded practice quiz in a non-accredited course, where the correct answer is provided in the next step... — desertnaut, Jul 27 '17 at 17:10

score 0 · Answer 22 · edited Aug 27 '20 at 20:26

The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can't tell which one is the "biggest" or "smallest" because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.

The purpose of subtracting the max value from the vector is that when you do e^y exponents you may get very high value that clips the float at the max value leading to a tie, which is not the case in this example. This becomes a BIG problem if you subtract the max value to make a negative number, then you have a negative exponent that rapidly shrinks the values altering the ratio, which is what occurred in poster's question and yielded the incorrect answer.

The answer supplied by Udacity is HORRIBLY inefficient. The first thing we need to do is calculate e^y_j for all vector components, KEEP THOSE VALUES, then sum them up, and divide. Where Udacity messed up is they calculate e^y_j TWICE!!! Here is the correct answer:

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

kingspp · Answer 23 · 2018-10-03T15:22:33.367

Goal was to achieve similar results using Numpy and Tensorflow. The only change from original answer is axis parameter for np.sum api.

Initial approach : axis=0 - This however does not provide intended results when dimensions are N.

Modified approach: axis=len(e_x.shape)-1 - Always sum on the last dimension. This provides similar results as tensorflow's softmax function.

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

mrgloom · Answer 24 · 2019-03-17T15:51:54.727

Here is generalized solution using numpy and comparision for correctness with tensorflow ans scipy:

Data preparation:

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

Output:

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

Softmax using tensorflow:

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using scipy:

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using numpy (https://nolanbconaway.github.io/blog/2017/softmax-numpy) :

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

score 0 · Answer 25 · answered Sep 14 '20 at 19:34

0

This generalizes and assumes you are normalizing the trailing dimension.

def softmax(x: np.ndarray) -> np.ndarray:
    e_x = np.exp(x - np.max(x, axis=-1)[..., None])
    e_y = e_x.sum(axis=-1)[..., None]
    return e_x / e_y

answered Sep 14 '20 at 19:34

user18764

253
2
10

score -1 · Answer 26 · answered May 05 '22 at 06:05

-1

I used these three simple lines:

x_exp=np.exp(x)
x_sum=np.sum(x_exp, axis = 1, keepdims = True)
s=x_exp / x_sum

answered May 05 '22 at 06:05

Ali Ganjbakhsh

541
7
13

How to implement the Softmax function in Python

26 Answers26

Linked

Related