Add class information to keras network

Question

I am trying to figure out how I will use the label information of my dataset with Generative Adversarial Networks. I am trying to use the following implementation of conditional GANs that can be found here. My dataset contains two different image domains (real objects and sketches) with common class information (chair, tree, orange etc). I opted for this implementation which only considers the two different domains as different "classes" for the correspondence (train samples X correspond to the real images while target samples y correspond to the sketch images).

Is there a way to modify my code and take into account the class information (chair, tree, etc.) in my whole architecture? I want actually my discriminator to predict whether or not my generated images from the generator belong to a specific class and not only whether they are real or not. As it is, with the current architecture, the system learns to create similar sketches in all cases.

Update: The discriminator returns a tensor of size 1x7x7 then both y_true and y_pred are passed through a flatten layer before calculating the loss:

def discriminator_loss(y_true, y_pred):
     BATCH_SIZE=100
     return K.mean(K.binary_crossentropy(K.flatten(y_pred), K.concatenate([K.ones_like(K.flatten(y_pred[:BATCH_SIZE,:,:,:])),K.zeros_like(K.flatten(y_pred[:BATCH_SIZE,:,:,:])) ]) ), axis=-1)

and the loss function of the discriminator over the generator:

def discriminator_on_generator_loss(y_true,y_pred):
     BATCH_SIZE=100
     return K.mean(K.binary_crossentropy(K.flatten(y_pred), K.ones_like(K.flatten(y_pred))), axis=-1)

Furthremore, my modification of the discriminator model for output 1 layer:

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
#model.add(Activation('sigmoid'))

Now the discriminator outputs 1 layer. How can I modify the above-mentioned loss functions correspondingly? Should I have 7 instead of 1, for the n_classes = 6 + one class for predicting real and fake pairs?

Sorry, I lost you in the last 2-3 lines. You want that your code should include details about architecture and classes? What does this mean? — unholy_me, Jun 18 '18 at 12:23
The code uses as target information the target domain (the sketch images) and as input samples the real ones. The loss function is the L1 distance between the generated sketch image and the one from the dataset. Is there a possibility in the discriminator to predict also the class and not only if the generated image is real or not. — Jose Ramon, Jun 18 '18 at 13:38
What does your generator generate? The sketchs? What is the input to the discriminator? Fake x True sketchs? I'm almost totally lost in your explanation. — Daniel Möller, Jun 20 '18 at 15:50
The generator generates the sketches. The reason I want the class information is to use it for the discriminator loss function and by doing so I am hopping that generator will learn better to produce sketches for each class. I hope also that by doing so maybe i will increase the quality of my results. — Jose Ramon, Jun 20 '18 at 16:58
"For the discriminator loss function". Please, be more specific. Do you want the discriminator to predict the class of the fake sketch? — Daniel Möller, Jun 20 '18 at 17:47
It's way easier to know what each part should "do" than just "use the information" without knowing what to do with it. — Daniel Möller, Jun 20 '18 at 17:48
Yes exactly, as it is the discriminator it is trained to recognise if a pair of the initial real image and the labelled initial sketch or the real image with generated sketch are genuine or not. I want to be able to calculate the label too. — Jose Ramon, Jun 20 '18 at 17:59
Hmmm.. and at the connection between the models? Do you want the generator to also predict a class or keep it just predicting a sketch? — Daniel Möller, Jun 20 '18 at 18:10
I have the label information already from the input, so I am not sure if it does make sense. Only maybe the loss function of the generator could possible take into account the cross entropy loss from the discriminator. — Jose Ramon, Jun 20 '18 at 18:33

benjaminplanche · Accepted Answer · 2018-06-22T13:35:10.787

Theoretical Details

I believe there are some misunderstandings regarding how conditional GANs work and what is the discriminators role in such schemes.

Role of the Discriminator

In the min-max game which is GAN training [4], the discriminator D is playing against the generator G (the network you actually care about) so that under D's scrutiny, G becomes better at outputting realistic results.

For that, D is trained to tell apart real samples from samples from G ; while G is trained to fool D by generating realistic results / results following the target distribution.

Note: in the case of conditional GANs, i.e. GANs mapping an input sample from one domain A (e.g. real picture) to another domain B (e.g. sketch), D is usually fed with the pairs of samples stacked together and has to discriminate "real" pairs (input sample from A + corresponding target sample from B) and "fake" pairs (input sample from A + corresponding output from G) [1, 2]

Training a conditional generator against D (as opposed to simply training G alone, with a L1/L2 loss only e.g. DAE) improves the sampling capability of G, forcing it to output crisp, realistic results instead of trying to average the distribution.

Even though discriminators can have multiple sub-networks to cover other tasks (see next paragraphs), D should keep at least one sub-network/output to cover its main task: telling real samples from generated ones apart. Asking D to regress further semantic information (e.g. classes) alongside may interfere with this main purpose.

Note: D output is often not a simple scalar / boolean. It is common to have a discriminator (e.g. PatchGAN [1, 2]) returning a matrix of probabilities, evaluating how realistic patches made from its input are.

Conditional GANs

Traditional GANs are trained in an unsupervised manner to generate realistic data (e.g. images) from a random noise vector as input. [4]

As previously mentioned, conditional GANs have further input conditions. Along/instead of the noise vector, they take for input a sample from a domain A and return a corresponding sample from a domain B. A can be a completely different modality, e.g. B = sketch image while A = discrete label ; B = volumetric data while A = RGB image, etc. [3]

Such GANs can also be conditioned by multiples inputs, e.g. A = real image + discrete label while B = sketch image. A famous work introducing such methods is InfoGAN [5]. It presents how to condition GANs on multiple continuous or discrete inputs (e.g. A = digit class + writing type, B = handwritten digit image), using a more advanced discriminator which has for 2nd task to force G to maximize the mutual-information between its conditioning inputs and its corresponding outputs.

Maximizing the Mutual Information for cGANs

InfoGAN discriminator has 2 heads/sub-networks to cover its 2 tasks [5]:

One head D1 does the traditional real/generated discrimination -- G has to minimize this result, i.e. it has to fool D1 so that it can't tell apart real form generated data;
Another head D2 (also named Q network) tries to regress the input A information -- G has to maximize this result, i.e. it has to output data which "show" the requested semantic information (c.f. mutual-information maximization between G conditional inputs and its outputs).

You can find a Keras implementation here for instance: https://github.com/eriklindernoren/Keras-GAN/tree/master/infogan.

Several works are using similar schemes to improve control over what a GAN is generating, by using provided labels and maximizing the mutual information between these inputs and G outputs [6, 7]. The basic idea is always the same though:

Train G to generate elements of domain B, given some inputs of domain(s) A;
Train D to discriminate "real"/"fake" results -- G has to minimize this;
Train Q (e.g. a classifier ; can share layers with D) to estimate the original A inputs from B samples -- G has to maximize this).

Wrapping Up

In your case, it seems you have the following training data:

real images Ia
corresponding sketch images Ib
corresponding class labels c

And you want to train a generator G so that given an image Ia and its class label c, it outputs a proper sketch image Ib'.

All in all, that's a lot of information you have, and you can supervise your training both on the conditioned images and the conditioned labels... Inspired from the aforementioned methods [1, 2, 5, 6, 7], here is a possible way of using all this information to train your conditional G:

Network G:

Inputs: Ia + c
Output: Ib'
Architecture: up-to-you (e.g. U-Net, ResNet, ...)
Losses: L1/L2 loss between Ib' & Ib, -D loss, Q loss

Network D:

Inputs: Ia + Ib (real pair), Ia + Ib' (fake pair)
Output: "fakeness" scalar/matrix
Architecture: up-to-you (e.g. PatchGAN)
Loss: cross-entropy on the "fakeness" estimation

Network Q:

Inputs: Ib (real sample, for training Q), Ib' (fake sample, when back-propagating through G)
Output: c' (estimated class)
Architecture: up-to-you (e.g. LeNet, ResNet, VGG, ...)
Loss: cross-entropy between c and c'

Training Phase:

Train D on a batch of real pairs Ia + Ib then on a batch of fake pairs Ia + Ib';
Train Q on a batch of real samples Ib;
Fix D and Q weights;
Train G, passing its generated outputs Ib' to D and Q to back-propagate through them.

Note: this is a really rough architecture description. I'd recommend going through the literature ([1, 5, 6, 7] as a good start) to get more details and maybe a more elaborate solution.

References

Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." arXiv preprint (2017). http://openaccess.thecvf.com/content_cvpr_2017/papers/Isola_Image-To-Image_Translation_With_CVPR_2017_paper.pdf
Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." arXiv preprint arXiv:1703.10593 (2017). http://openaccess.thecvf.com/content_ICCV_2017/papers/Zhu_Unpaired_Image-To-Image_Translation_ICCV_2017_paper.pdf
Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014). https://arxiv.org/pdf/1411.1784
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Chen, Xi, et al. "Infogan: Interpretable representation learning by information maximizing generative adversarial nets." Advances in Neural Information Processing Systems. 2016. http://papers.nips.cc/paper/6399-infogan-interpretable-representation-learning-by-information-maximizing-generative-adversarial-nets.pdf
Lee, Minhyeok, and Junhee Seok. "Controllable Generative Adversarial Network." arXiv preprint arXiv:1708.00598 (2017). https://arxiv.org/pdf/1708.00598.pdf
Odena, Augustus, Christopher Olah, and Jonathon Shlens. "Conditional image synthesis with auxiliary classifier gans." arXiv preprint arXiv:1610.09585 (2016). http://proceedings.mlr.press/v70/odena17a/odena17a.pdf

So you just add another classifier and use the loss of that on top of the other models and use the error of that model to modified the other models? — Jose Ramon, Jun 22 '18 at 16:03
As a simple scheme, yes. It's just on top of your generator model though, and in parallel of the discriminator. — benjaminplanche, Jun 22 '18 at 16:34
Hm ok then i is like an extra discriminator (lets say), both the discriminator and classfiier will provide some erro to the generator. Correct me if I am mistaken? — Jose Ramon, Jun 22 '18 at 18:43
And instead of discriminator_on_generator I will use the generator_containing_discriminator_and_classifier, right? — Jose Ramon, Jun 22 '18 at 18:50
By checking the code of the infoGan I noticed that they do not have any model for the discriminator, they just do have a generator model and on top of it they have: build_disk_and_q_net. — Jose Ramon, Jun 22 '18 at 19:29
@JoseRamon: Yes, for your 2 first questions. :) About InfoGAN code, they do have a discriminator and an auxiliary classifier, instantiated [here](https://github.com/eriklindernoren/Keras-GAN/blob/master/infogan/infogan.py#L31). These 2 networks are sharing most of their layers, but have different outputs and losses. The authors then create the `self.combined` model which does the back-propagation end to end. I believe this code nicely does what you want. You could even improve the training by adding a L2 loss to the generator as you have training pairs (InfoGAN normally assume you don't have). — benjaminplanche, Jun 23 '18 at 15:50
By the way eally cool exaplanation about everything. I have one question for your code, about X and y in the training part. The X is of size (200, 6, 28, 28) since 200 is the batch size and every image has size 3x28x28 and after the concatenation of the two domains 6x28x28. In your code for the y you had: y = np.zeros((20, 1, 64, 64)) I guess instead of 20 should be 200. Also for this line c_loss = classifier.train_on_batch(image_batch, label_batch) I am getting: ValueError: Error when checking target: expected dense_3 to have shape (6,) but got array with shape (1,). — Jose Ramon, Jun 25 '18 at 15:15
@JoseRamon: I simply copied and edited the [code you shared](https://github.com/r0nn13/conditional-dcgan-keras/blob/master/conditional_gan.py) which is using batches of 20 elements. Feel free to change to 200. And you are right, your class labels should be of same shape as the class predictions, i.e. `(6,)`. You have to one-hot your labels. See [there](https://stackoverflow.com/questions/38592324/one-hot-encoding-using-numpy) how to do it with numpy. — benjaminplanche, Jun 26 '18 at 07:16
Yes I managed to run it smoothly. Still I did not receive the most col results but it seems that it affects a little bit the generator (it produdes results closer to the classes). I want to ask a last thing, is there a chance that my loss function are incorrect. The one of the code i posted? I am not sure also what they stand for (the loss function for discriminator and the discriminator_on_generator_loss, it is the cross_entropy of what?). For the generator it is clear the MSE between the y_pred and y_true — Jose Ramon, Jun 26 '18 at 08:46
Furthermore, do I need to change the generate function? Is it necessary to provide (infer) in the function the class information too? — Jose Ramon, Jun 26 '18 at 15:28
@JoseRamon: Your discriminative losses look correct (i.e. cross-entropy over the true/estimated "fakeness", to teach `D` to discriminate real & fake pairs; to teach `G` to fool `D` so it thinks fake images are real). About providing or not class information to `G`, it depends on your use-case. If you will always have access to this information when inferring new data, it makes sense to take advantage and use the label as generation input. If you have only class labels during training, you could have a look at solutions like Triple-GAN, PixelDA, etc. (w/ auxiliary classifier to help train `G`) — benjaminplanche, Jun 27 '18 at 07:25
No I could possible have them both in train and test modules. However, I guess in order to pass the class information in the generate function I need to infer them also to the generator, right? — Jose Ramon, Jun 27 '18 at 08:23
I'm sorry, I'm not sure what you mean by "infer" in this case...? If you mean to give the class labels as inputs to the generator to guide it, it is indeed possible - and it is what InfoGAN does. Have a look [here](https://github.com/eriklindernoren/Keras-GAN/blob/master/infogan/infogan.py#L168), how they pass the `sampled_labels` as inputs to `G` (here they concatenate it to a noise vector `sampled_noise` as they don't have input images ; in your case you could give separately the image and label as inputs, then have `G` concatenate them after some `Dense` layers). — benjaminplanche, Jun 27 '18 at 13:37
Thanks for the valuable help. I will take a closer look to the code of the InfoGAN. — Jose Ramon, Jun 27 '18 at 14:42
The issue for doing the same is that one can concatenate the noise with the semantic information of the class 128x62 and 128x10 in the case of MNIST. 128 the size of the batch and 10 the number of the classes so in the end 128x72. I am not sure how can I do the same with the images since they are from different dimension. — Jose Ramon, Jun 28 '18 at 15:28
This is true, and this is were things get a bit more experimental. As suggested in my previous comment, you could basically concatenate the 2 pieces of information after some encoding layers (e.g. `Conv` for image, `Dense` for label) and a `Flatten` layer. — benjaminplanche, Jun 29 '18 at 09:59
What I noticed is that the generated images I am producing while they have really good analysis and they seem realistic they struggle to represent the classes. When I input the class label information into the generator then the generated images are way better at representing the classes. Does that has to do with my classifier Q? — Jose Ramon, Sep 12 '18 at 08:00
Isnt it that this line of code is pointless: "c_loss = classifier.train_on_batch(image_batch, label_batch)" I should have there the generated_images instead, no? Or to train initially with the real data and just produce the error from the generated_images. — Jose Ramon, Sep 14 '18 at 14:11
Yes to the 2nd answer. In this simple scheme, the classifier is only trained on real data, and applied to generated images as an auxiliary loss. — benjaminplanche, Sep 24 '18 at 09:25
My issue is that I am struggling to understand where I calculate the evaluation error of the classifier over the generated images. I guess this is happening in x_classifier = classifier(x_generator) that can be found in generator_containing_discriminator_and_classifier. Furthermore, is there any proposed way to improve the efficiency of the classifier? — Jose Ramon, Oct 09 '18 at 09:26

Daniel Möller · Answer 2 · 2018-06-22T21:15:45.393

You should modify your discriminator model, either to have two outputs, or to have a "n_classes + 1" output.

Warning: I don't see in the definition of your discriminator it outputting 'true/false', I see it outputting an image...

Somewhere it should contain a GlobalMaxPooling2D or an GlobalAveragePooling2D.
At the end and one or more Dense layers for classification.

If telling true/false, the last Dense should have 1 unit.
Otherwise n_classes + 1 units.

So, the ending of your discriminator should be something like

...GlobalMaxPooling2D()...
...Dense(someHidden,...)...
...Dense(n_classes+1,...)...

The discriminator will now output n_classes plus either a "true/fake" sign (you will not be able to use "categorical" there) or even a "fake class" (then you zero the other classes and use categorical)

Your generates sketches should be passes to the discriminator along with a target that will be the concatenation of the fake class with the other class.

Option 1 - Using the "true/fake" sign. (Don't use "categorical_crossentropy")

#true sketches into discriminator:
fakeClass = np.zeros((total_samples,))
sketchClass = originalClasses

targetClassTrue = np.concatenate([fakeClass,sketchClass], axis=-1)

#fake sketches into discriminator:
fakeClass = np.ones((total_fake_sketches))
sketchClass = originalClasses

targetClassFake = np.concatenate([fakeClass,sketchClass], axis=-1)

Option 2 - Using the "fake class" (can use "categorical_crossentropy"):

#true sketches into discriminator:
fakeClass = np.zeros((total_samples,))
sketchClass = originalClasses

targetClassTrue = np.concatenate([fakeClass,sketchClass], axis=-1)

#fake sketches into discriminator:
fakeClass = np.ones((total_fake_sketches))
sketchClass = np.zeros((total_fake_sketches, n_classes))

targetClassFake = np.concatenate([fakeClass,sketchClass], axis=-1)

Now concatenate everything into a single target array (respective to the input sketches)

Updated training method

For this training method, your loss function should be one of:

discriminator.compile(loss='binary_crossentropy', optimizer=....)
discriminator.compile(loss='categorical_crossentropy', optimizer=...)

Code:

for epoch in range(100):
    print("Epoch is", epoch)
    print("Number of batches", int(X_train.shape[0]/BATCH_SIZE))

    for index in range(int(X_train.shape[0]/BATCH_SIZE)):

        #names:
            #images -> initial images, not changed    
            #sketches -> generated + true sketches    
            #classes -> your classification for the images    
            #isGenerated -> the output of your discriminator telling whether the passed sketches are fake

        batchSlice = slice(index*BATCH_SIZE,(index+1)*BATCH_SIZE)
        trueImages = X_train[batchSlice]

        trueSketches = Y_train[batchSlice] 
        trueClasses = originalClasses[batchSlice]
        trueIsGenerated = np.zeros((len(trueImages),)) #discriminator telling whether the sketch is fake or true (generated images = 1)
        trueEndTargets = np.concatenate([trueIsGenerated,trueClasses],axis=1)

        fakeSketches = generator.predict(trueImages)
        fakeClasses = originalClasses[batchSlize]             #if option 1 -> telling class + isGenerated - use "binary_crossentropy"
        fakeClasses = np.zeros((len(fakeSketches),n_classes)) #if option 2 -> telling if generated is an individual class - use "categorical_crossentropy"    
        fakeIsGenerated = np.ones((len(fakeSketches),))
        fakeEndTargets = np.concatenate([fakeIsGenerated, fakeClasses], axis=1)

        allSketches = np.concatenate([trueSketches,fakeSketches],axis=0)            
        allEndTargets = np.concatenate([trueEndTargets,fakeEndTargets],axis=0)

        d_loss = discriminator.train_on_batch(allSketches, allEndTargets)

        pred_temp = discriminator.predict(allSketches)
        #print(np.shape(pred_temp))
        print("batch %d d_loss : %f" % (index, d_loss))

        ##WARNING## In previous keras versions, "trainable" only takes effect if you compile the models. 
            #you should have the "discriminator" and the "discriminator_on_generator" with these set at the creation of the models and never change it again   

        discriminator.trainable = False
        g_loss = discriminator_on_generator.train_on_batch(trueImages, trueEndTargets)
        discriminator.trainable = True


        print("batch %d g_loss : %f" % (index, g_loss[1]))
        if index % 20 == 0:
            generator.save_weights('generator', True)
            discriminator.save_weights('discriminator', True)

Compiling the models properly

When you create "discriminator" and "discriminator_on_generator":

discriminator.trainable = True
for l in discriminator.layers:
    l.trainable = True


discriminator.compile(.....)

for l in discriminator_on_generator.layer[firstDiscriminatorLayer:]:
    l.trainable = False

discriminator_on_generator.compile(....)

The discriminator outputs a tensor of size 1x7x7 and then in the discriminator loss funciton flattens it as in the updated example in my post. — Jose Ramon, Jun 21 '18 at 15:23
If it's a true/false classificator, why not make it predict classes already? — Daniel Möller, Jun 21 '18 at 15:42
It seems to me that it is the same. Instead of performing the flattening in the Model it is performed in the calculation of the loss function. — Jose Ramon, Jun 21 '18 at 16:00
And what are you comparing them to if they're not a true/false sign? — Daniel Möller, Jun 21 '18 at 16:12
My labels are concern the genuine vs the generated pairs and are zero and ones. Lines 190-194 in the initial code from the github. — Jose Ramon, Jun 21 '18 at 22:28
An image with "all zeros" or "all ones"? If that's the case, switch to my answer, it will be easy to add classes then. — Daniel Möller, Jun 22 '18 at 01:06
In the case that I will have in my discriminator a flatten layer and a dense in the end using sth like that: model.add(Flatten()) and model.add(Dense(1, activation='sigmoid')) how can I modity the loss function accordingly? — Jose Ramon, Jun 22 '18 at 12:25
I personally don't think it is a good idea to use the DCGAN discriminator `D` to also regress the class labels. As `D` receives for inputs the pairs of real/sketch images, chances are high that `D` will only rely on the real, fixed images for classification, and ignore the target/generated sketches, thus not improving the mutual information between the inputs and generated outputs. As tentatively explained in my answer, I would use a third network `Q` only trained on/applied to sketches (not pairs) for this task. — benjaminplanche, Jun 22 '18 at 12:39
Should the discriminator receive pairs? Each image, fake or true, should be individual, no? — Daniel Möller, Jun 22 '18 at 12:41
If you pass a pair, it will rely on "pair is equal" versus "pair is different", quite useless. — Daniel Möller, Jun 22 '18 at 12:42
But how exactly should I modified the loss function of discriminator and the loss function of discriminator over the generator? — Jose Ramon, Jun 22 '18 at 12:50
@DanielMöller In DCGAN architecture, `D` is trained to discriminate "real" pairs (input image + corresponding target image) VS "fake" pairs (input image + corresponding `G` output). This forces `G` not only to output realistic images, but realistic images which correspond to their conditioning inputs (e.g. sketch of the input picture, not an unrelated yet realistic sketch). — benjaminplanche, Jun 22 '18 at 13:05
@JoseRamon, I updated my answer with a new training code (there are problems there, as in the warning, and I also thing you should train only the discriminator for several epochs, then only the generator for several epochs, then only the discriminator again.) --- With that code, you use any regular loss function, you don't need to worry about it. — Daniel Möller, Jun 22 '18 at 13:14
Considering Aldream's comment, if you want to input "pairs" to the discriminator (I didn't do that), use the "option 2". — Daniel Möller, Jun 22 '18 at 13:15
Regarding Aldream's first comment, I was considering that from the beginning, a network trained only on sketchs. I totally agree with their comment. You'd probably need a parallel network for taking classes into consideration without extracting them directly from the true images. — Daniel Möller, Jun 22 '18 at 13:22
With the regular loss function you mean cross entropy or the one of my code as it is? Right now for the generator I am using mse, and for discriminator and the discriminator_on_generator the one posted in my thread. — Jose Ramon, Jun 22 '18 at 13:27
Any loss is fine, see the answer for details. Each option would use a different loss, but among keras existing losses. — Daniel Möller, Jun 22 '18 at 13:29
@JoseRamon: As Daniel said, default cross-entropy would work fine as long as you define properly your labels (real/fake and/or classes). — benjaminplanche, Jun 22 '18 at 13:43
@DanielMöller: Yes, using another classifier network along the original DCGAN discriminator seems the cleanest way to train while utilizing all the training data (pairs of images + labels). I updated my answer with a rough concrete example. — benjaminplanche, Jun 22 '18 at 13:43