LibSVM turns all my training vectors into support vectors, why?

Question

I am trying to use SVM for News article classification.

I created a table that contains the features (unique words found in the documents) as rows. I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.

Ex:- Training sample generated...

1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 17:1 18:1 19:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1

As this is the first document all the features are present.

I am using 1, 0 as class labels.

I am using svm.Net for classification.

I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.

My total features (unique words/row count in feature vector DB table) is 7610.

What could be the reason?

Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.

In LibSVM binary classification is there any restriction on the class label?

I am using 0, 1 instead of -1 and +1. Is that a problem?

score 3 · Answer 1 · answered Apr 20 '11 at 18:18

3

You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it

answered Apr 20 '11 at 18:18

carlosdc

12,022
4
45
62

score 1 · Answer 2 · answered Apr 22 '11 at 03:23

1

I would definitely try using -1 and +1 for your labels, that's the standard way to do it.

Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.

With that many features, you might want to try some type of feature selection method like principle component analysis.

answered Apr 22 '11 at 03:23

Colin

10,447
11
46
54

Found the reason, this is happening because SVM.net is not checking the validity of trainingdata. In my training data feature numbers were not sorted, as a result it was generating weird results. After sorting the weight vector on feature numbers and then generating the model things are far better...74% accuracy. Thank you. – Krishna Chaitanya M Apr 23 '11 at 07:12

score 1 · Accepted Answer · answered Apr 22 '11 at 15:50

As pointed out, a parameter search is probably a good idea before doing anything else.

I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)

For more information and perhaps better answers, look on stats.stackexchange.com.

LibSVM turns all my training vectors into support vectors, why?

3 Answers3

Linked