What change occur when we provide a label to a factor field ? In the following code i have once assigned labels as 0 and 1 and the next time i have assigned labels as 0 and 10^6 . As per my knowledge labels are just providing the alternate name to the categories which in this case are Male and Female. Please note i have provided numeric labels not character labels.
It seems like labels are providing some sort of numeric weight to the categories which change the eucladian distance for a datapoint. Below provided are two codes for the problem with the corresponding results
Dataset
> head(dataset)
   User.ID Gender Age EstimatedSalary Purchased
1 15624510      1  19           19000         0
2 15810944      1  35           20000         0
3 15668575      0  26           43000         0
4 15603246      0  27           57000         0
5 15804002      1  19           76000         0
6 15728773      1  27           58000         0
R code with labels = c(0 , 1)
    dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
Result
Confusion Matrix and Statistics
model_classifier  0  1
               0 47 18
               1  4 11
Accuracy : 0.725  
R code with labels = c(0 , 10^6)
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 10^6))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
Result
Confusion Matrix and Statistics
model_classifier  0  1
               0 50 23
               1  1  6
Accuracy : 0.7   
What exactly is label doining? If we provide numeric labels does it have same mathematical significance
