num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
preconfig = DistilBertConfig(n_layers=6)
model1 = AutoModelForSequenceClassification.from_config(preconfig)
model2 = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
I am modifying this code (modified code is provided above) to test DistilBERT transformer layer depth size via from_config since from my knowledge from_pretrained uses 6 layers because in the paper section 3 they said:
we initialize the student from the teacher by taking one layer out of two
While what I want to test is various sizes of layers. To test whether both are the same, I tried running the from_config
with n_layers=6 because based on the documentation DistilBertConfig the n_layers is used to determine the transformer block depth. However as I run model1 and model2 I found that with SST-2 dataset, in accuracy:
model1achieved only0.8073model2achieved0.901
If they both behave the same I expect the result to be somewhat similar but 10% drop is a significant drop, therefore I believe there ha to be a difference between the functions. Is there a reason behind the difference of the approach (for example model1 has not yet applied hyperparameter search) and is there a way to make both functions behave the same? Thank you!