I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
                      mpg  cyl   disp   hp  drat  ...   qsec  vs  am  gear  carb
Mazda RX4            21.0    6  160.0  110  3.90  ...  16.46   0   1     4     4
Mazda RX4 Wag        21.0    6  160.0  110  3.90  ...  17.02   0   1     4     4
Datsun 710           22.8    4  108.0   93  3.85  ...  18.61   1   1     4     1
Hornet 4 Drive       21.4    6  258.0  110  3.08  ...  19.44   1   0     3     1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
 
    