How can I compute class weights for an output that has 4 neurons with keras?

Question

I've seen how to do some class weight imbalance correction for a single classification. But in my case, my output layer is:

model.add(Dense(4, activation='sigmoid'))

My target is a DataFrame that has:

       0  1  2  3
0      1  1  0  0
1      0  0  0  0
2      1  1  1  0
3      1  1  0  0
4      1  1  0  0
5      1  1  0  0
6      1  0  0  0
...   .. .. .. ..
14989  1  1  1  1
14990  1  1  1  0
14991  1  1  1  1
14992  1  1  1  0

[14993 rows x 4 columns]

My predictions can take the shape of one of 5 possible values:

[[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]

However, those classes certainly are not balanced. I've seen how to computer the class weights if I have 1 target output with a softmax, but this is slightly different.

Specifically,

model.fit(..., class_weights=weights)

How can I define weights in this case?

There are 5 exclusive outputs of your neural network, why cannot those be mapped onto integers 0 through 4? — Szymon Maszke, Mar 05 '19 at 20:15
Well - that would make it a categorical problem and would treat incorrectness equally. In my case, a prediction of `[1, 1, 1, 0]` is closer to `[1, 1, 1, 1]` than to `[0, 0, 0, 0]` — Shamoon, Mar 05 '19 at 20:17
Then it sounds you have a regression problem, not a classification one. In classification there are no distances (closer) between predictions — Dr. Snoopy, Mar 05 '19 at 23:12
Classically, Id agree with you that it seems like regression. But in this case it’s categorization with ordinality. — Shamoon, Mar 06 '19 at 00:10
`np.sum(targets, axis=-1)/len(targets)` for the negative classes, `1-weight` for the positive classes. — Daniel Möller, Mar 08 '19 at 17:31

Szymon Maszke · Accepted Answer · 2019-03-18T02:36:51.407

Possible solution

IMO you should use almost standard categorical_crossentropy and output logits from the network which will be mapped in loss function to values [0,1,2,3,4] using argmax operation (same procedure will be applied to one-hot-encoded labels, see last part of this answer for an example).

Using weighted crossentropy you can treat incorrectness differently based on the predicted vs correct values as you said you indicated in the comments.

All you have to do is to take absolute value of subtracted correct and predicted value and multiply it by loss, see example below:

Let's map each encoding to it's unary value (can be done using argmax as later seen):

[0, 0, 0, 0] -> 0
[1, 0, 0, 0] -> 1
[1, 1, 0, 0] -> 2
[1, 1, 1, 0] -> 3
[1, 1, 1, 1] -> 4

And let's make some random targets and predictions by the model to see the essence:

   correct  predicted with Softmax
0        0                       4
1        4                       3
2        3                       3
3        1                       4
4        3                       1
5        1                       0

Now, when you subtract correct and predicted and take absolute you essentially get weighting column like this:

As you can see, prediction of 0 while true target is 4 will be weighted 4 times more than prediction of 3 with the same 4 target and that is what you want essentially IIUC.

As Daniel Möller indicates in his answer I would advise you to create a custom loss function as well but a little simpler:

import tensorflow as tf

# Output logits from your network, not the values after softmax activation
def weighted_crossentropy(labels, logits):
    return tf.losses.softmax_cross_entropy(
        labels,
        logits,
        weights=tf.abs(tf.argmax(logits, axis=1) - tf.argmax(labels, axis=1)),
    )

And you should use this loss in your model.compile as well, I think there is no need to reiterate points already made.

Disadvantages of this solution:

For correct predictions gradient will be equal to zero, which means it will be harder for network to strengthen connections (maximize/minimize logits towards +inf/-inf)
Above can be mitigated by adding random noise (additional regularization) to each weighted loss. Would act as a regularization as well, might help.
Better solution might be to exclude from weighting case where predictions are equal (or make it 1), it would not add randomization to network optimization.

Advantages of this solution:

You can easily add weighting for imbalanced dataset (e.g. certain classes ocuring more often)
Maps cleanly to existing API
Simple conceptually and remains in classification realm
Your model cannot predict nonexistent classification values, e.g. with your multitarget case it could predict [1, 0, 1, 0], there is no such with approach above. Less degree of freedom would help it train and remove chances for nonsensical (if I got your problem description right) predictions.

Additional discussion provided in the chat room in comments

Example network with custom loss

Here is an example network with the custom loss function defined above. Your labels have to be one-hot-encoded in order for it to work correctly.

import keras    
import numpy as np
import tensorflow as tf

# You could actually make it a lambda function as well
def weighted_crossentropy(labels, logits):
    return tf.losses.softmax_cross_entropy(
        labels,
        logits,
        weights=tf.abs(tf.argmax(logits, axis=1) - tf.argmax(labels, axis=1)),
    )


model = keras.models.Sequential(
    [
        keras.layers.Dense(32, input_shape=(10,)),
        keras.layers.Activation("relu"),
        keras.layers.Dense(10),
        keras.layers.Activation("relu"),
        keras.layers.Dense(5),
    ]
)

data = np.random.random((32, 10))
labels = keras.utils.to_categorical(np.random.randint(5, size=(32, 1)))

model.compile(optimizer="rmsprop", loss=weighted_crossentropy)
model.fit(data, labels, batch_size=32)

Is that a valid `keras` loss function? I thought we need to pass `yTrue, yPred` to the loss function — Shamoon, Mar 11 '19 at 14:06
Also, my model would have to change from `model.add(Dense(4, activation='sigmoid'))` to `model.add(Dense(1, activation='relu'))` I assume? — Shamoon, Mar 11 '19 at 14:13
If you pass logits instead of `y_pred` it should be fine (e.g. your model has no activation on the last layer). Your predictions are already in the same shape as required by softmax so I guess you will be fine (might need some tuning). BTW. there is argument `from_logits` in `categorical_crossentropy` so yeah. — Szymon Maszke, Mar 11 '19 at 14:15
No, you would remove activation altogether (e.g. make it linear), `model.add(Dense(5))` as final layer. See [here](https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow) for a few words about logits if you are uncertain about the idea. — Szymon Maszke, Mar 11 '19 at 14:16
My bad, it should be fine now, improved on the logits case, so you don't have to calculate `softmax` twice. — Szymon Maszke, Mar 11 '19 at 15:47
Are we trying to maximize the loss or minimize it? Because it seems to continue to increase over successive epochs — Shamoon, Mar 11 '19 at 16:10
That's expected, you are putting more weight on the wrong classifications being "further apart" so neural network has a hard time accomodating to this assumption (0 pred with 4 target is much worse than 3 pred with 4 target, this does not happen in regular classification). Monitor other values instead of `accuracy`, e.g. average distance between `predictions` and `targets`. — Szymon Maszke, Mar 11 '19 at 16:39
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189830/discussion-between-szymon-maszke-and-shamoon). — Szymon Maszke, Mar 11 '19 at 17:50

razimbres · Answer 2 · 2019-03-08T20:14:43.647

1

(Removed) First, you should fix your one-hot encoding:

(Removed) pd.get_dummies(target)

Calculate each class weight by summing the amount of np.unique(target) and divide by target.shape[0], getting proportions:

target=np.array([0 0 0 0], [1 0 0 0], [1 1 0 0], [1 1 1 0], [1 1 1 1])

proportion=[]
for i in range(0,len(target)):
    proportion.append([i,len(np.where(target==np.unique(target)[i])[0])/target.shape[0]])

class_weight = dict(proportion)


model.fit(..., class_weights=class_weight)

edited Mar 08 '19 at 20:14

answered Mar 08 '19 at 19:52

razimbres

4,715
5
23
50

That's the thing - I'm not doing try `one hot`. You'll notice that my encodings are of a different form: `[0 0 0 0], [1 0 0 0], [1 1 0 0], [1 1 1 0], [1 1 1 1]` – Shamoon Mar 08 '19 at 19:53
Use np.unique for all `[0 0 0 0], [1 0 0 0], [1 1 0 0], [1 1 1 0], [1 1 1 1]` and compare amount of each in target. You will get the percentage of classes. – razimbres Mar 08 '19 at 19:56
Answer improved @Shamoon – razimbres Mar 08 '19 at 20:21
Is `target` my output? Meaning all of my predictions? – Shamoon Mar 08 '19 at 21:09
Yes. Your X train maps to Targets Y in training set. Your predictions are the outputs=target in test set. – razimbres Mar 08 '19 at 21:14
I get an error: ` [i, len(np.where(Y == np.unique(Y)[i])[0])/Y.shape[0]]) IndexError: index 2 is out of bounds for axis 0 with size 2` – Shamoon Mar 10 '19 at 02:28

Daniel Möller · Answer 3 · 2019-03-11T12:31:57.073

Considering you have your targets (ground truth y) with shape (samples, 4), you can simply:

positives = targetsAsNumpy.sum(axis=0)
totals = len(targetsAsNumpy)

negativeWeights = positives / totals
positiveWeights = 1 - negativeWeights

The class weights in the fit method are meant for categorical problems (only one correct class).

I suggest you create a custom loss with these. Supposing you are using binary_crossentropy.

import keras.backend as K

posWeightsK = K.constant(positiveWeights.reshape((1,4)))
negWeightsK = K.constant(negativeWeights.reshape((1,4)))

def weightedLoss(yTrue, yPred):

    loss = K.binary_crossentropy(yTrue, yPred)
    loss = K.switch(K.greater(yTrue, 0.5), loss * posWeigthsK, loss *  negWeightsK)
    return K.mean(loss) #optionally K.mean(loss, axis=-1) for further customization

Use this loss in the model:

model.compile(loss = weightedLoss, ...)

score 0 · Answer 4 · answered Mar 08 '19 at 19:27

Per-neuron errors

For this value encoding (unary, also called 'thermometer code') you can simply measure the error on each value separately and add them, using e.g. binary_crossentropy or even mean squared / mean absolute error metric. Given this output it's not really a classification problem, it's a discrete representation of a regression task; but such representations are effective in certain cases - e.g. as the paper Thermometer Encoding: One Hot Way To Resist Adversarial Examples describes.

While such separate error measurements doesn't ensure that 'invalid' outputs (e.g. [1 0 0 0 1]) are impossible, they'll be very unlikely for any well-fit network, and it does have the property that, if the correct value is [1 1 1 1 0] then a prediction of [1 1 0 0 0] is "twice as wrong" as a prediction of [1 1 1 0 0]. And you don't need to adjust the 'class weights' to achieve these results.