I have concrete problem with extending xgb.XGBClassifier class, but it could be framed as general OOP question.
My implementation is based on: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py
Basically, I want add feature names handling when provided data is in pandas DataFrame.
A few remarks:
XGBClassifierNhas the same parameters in__init__as base classxgb.XGBClassifier,- there is an additional attribute
self.feature_namesthat is set by laterfitmethod. - Rest could be done by mix-ins.
It works.
What bothers me, is this wall of code in __init__. It was done by copy-paste defaults and every time when xgb.Classifier will change it had to be updated.
Is there any way to concise express idea that child class XGBClassifierN has the same parameters and defaults as parent class xgb.XGBClassifier and do later things like clf = XGBClassifierN(n_jobs=-1)?
I've tried to use only **kwargs but it doesn't work out (interpreter starts to complain that there is no missing parameter (no pun intentented), and to make it work basically you need to set a few more parameters).
import xgboost as xgb
class XGBClassifierN(xgb.XGBClassifier):
def __init__(self, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1, **kwargs):
super().__init__(base_score=base_score, booster=booster, colsample_bylevel=colsample_bylevel,
colsample_bynode=colsample_bynode, colsample_bytree=colsample_bytree, gamma=gamma,
learning_rate=learning_rate, max_delta_step=max_delta_step, max_depth=max_depth,
min_child_weight=min_child_weight, missing=missing, n_estimators=n_estimators, n_jobs=n_jobs,
nthread=nthread, objective=objective, random_state=random_state,
reg_alpha=reg_alpha, reg_lambda=reg_lambda, scale_pos_weight=scale_pos_weight, seed=seed,
silent=silent, subsample=subsample, verbosity=verbosity, **kwargs)
self.feature_names = None
def fit(self, X, y=None):
self.feature_names = list(X.columns)
return super().fit(X, y)
def get_feature_names(self):
if not isinstance(self.feature_names, list):
raise ValueError('Must fit data first!')
else:
return self.feature_names
def get_feature_importances(self):
return dict(zip(self.get_feature_names(), self.feature_importances_))