I am developing a spam filter using Scikit.
Here are the steps I follow:
Xdata = ["This is spam" , "This is Ham" , "This is spam again"]
Matrix=Countvectorizer (XData). Matrix will contains count of each word in all documents. So Matrix[i][j] will give me counts of wordjin documentiMatrix_idfX=TFIDFVectorizer(Matrix). It will normalize score.Matrix_idfX_Select=SelectKBest( Matrix_IdfX , 500). It will reduce matrix to 500 best score columnsMultinomial.train( Matrix_Idfx_Select)
Now my question Do I need to perform normalization or standardization in any of the above four steps ? If yes then after which step and why?
Thanks