I'm working with a small data set of 5 variables and ~90k observations. I've tried fitting a random forest classifier mimicking the iris example from http://blog.yhathq.com/posts/random-forests-in-python.html. However, my challenge is that my predicted values are all the same: 0. I'm new to Python, but familiar with R. Not sure if this is a coding mistake, or if this means my data is trash.
from sklearn.ensemble import RandomForestClassifier
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])
data['is_train'] = np.random.uniform(0, 1, len(data)) <= .75
#data['type'] = pd.Categorical.from_codes(data['type'],["Type1","Type2"])
data.head()
Mytrain, Mytest = data[data['is_train']==True], data[data['is_train']==False]
Myfeatures = data.columns[1:5] # string of feature names: subtype dummy     variables
rf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(Mytrain['type'])
rf.fit(Mytrain[Myfeatures], y)
data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]
Predictions of one class, Type1:
In[583]: pd.crosstab(Mytest['type'], preds, rownames=['actual'], colnames ['preds'])
Out[582]: 
preds          Type1
actual                   
Type1          17818
Type2          7247
Update: First few rows of data:
In[670]: Mytrain[Myfeatures].head()
Out[669]: 
subtype_INDUSTRIAL  subtype_INSTITUTIONAL  subtype_MULTIFAMILY  \
0                   0                      0                    0   
1                   0                      0                    0   
2                   0                      0                    0   
3                   0                      0                    0   
4                   0                      0                    0   
subtype_SINGLE FAMILY / DUPLEX  
0                               0  
1                               0  
2                               0  
3                               1  
4                               1 
When I predict on the training inputs, I get predictions of only one class:
In[675]: np.bincount(rf.predict(Mytrain[Myfeatures]))
Out[674]: array([    0, 75091])
 
    