Fitting regression multiple times and gather summary statistics

Question

I have a dataframe that looks like this:

W01           0.750000     0.916667     0.642857      1.000000      0.619565   
W02           0.880000     0.944444     0.500000      0.991228      0.675439   
W03           0.729167     0.900000     0.444444      1.000000      0.611111   
W04           0.809524     0.869565     0.500000      1.000000      0.709091   
W05           0.625000     0.925926     0.653846      1.000000      0.589286   

Variation  1_941119_A/G  1_942335_C/G  1_942451_T/C  1_942934_G/C  \
W01            0.967391      0.965909             1      0.130435   
W02            0.929825      0.937500             1      0.184211   
W03            0.925926      0.880000             1      0.138889   
W04            0.918182      0.907407             1      0.200000   
W05            0.901786      0.858491             1      0.178571   

Variation  1_944296_G/A    ...     X_155545046_C/T  X_155774775_G/T  \
W01            0.978261    ...            0.652174         0.641304   
W02            0.938596    ...            0.728070         0.736842   
W03            0.944444    ...            0.675926         0.685185   
W04            0.927273    ...            0.800000         0.690909   
W05            0.901786    ...            0.794643         0.705357   

Variation  Y_5100327_G/T  Y_5100614_T/G  Y_12786160_G/A  Y_12914512_C/A  \
W01             0.807692       0.800000        0.730769        0.807692   
W02             0.655172       0.653846        0.551724        0.666667   
W03             0.880000       0.909091        0.833333        0.916667   
W04             0.666667       0.642857        0.580645        0.678571   
W05             0.730769       0.720000        0.692308        0.720000   

Variation  Y_13470103_G/A  Y_19705901_A/G  Y_20587967_A/C  mean_age  
W01              0.807692        0.666667        0.333333      56.3  
W02              0.678571        0.520000        0.250000      66.3  
W03              0.916667        0.764706        0.291667      69.7  
W04              0.666667        0.560000        0.322581      71.6  
W05              0.703704        0.600000        0.346154      72.5  

[5 rows x 67000 columns]

I would like to fit a simple Least squares linear regression and Thiel-Sen linear regression for each column as an independent variable and mean-age as the response variable and gather summary statistics including the slope, intercept, r value, p value and std err for each fit and preferably gathers the outputs as a datafarme!

So far, I have been slicing my 'df' and carrying out regression analysis for each column separately:

from scipy import stats
import time

# Start timer
start_time = time.time()

# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]

# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
    tdf['1_69270_A/G'])

print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
  '. Slope = ' + str(slope)+'\n')

# Create regression line
regressLine = intercept + tdf['mean_age']*slope

# Regression using Theil-Sen with 95% confidence intervals 
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)

print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
  str(res[0]) +'\n')

# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')

# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line'); 

# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')

# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')

# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')

# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')

I was wondering how I could carry out this analysis in an iterative loop for each column and gather the final results in a comprehensive dataframe.

I have seen [this](Looping regression and obtaining summary statistics in matrix form"Looping regression and obtaining summary statistics in matrix form ")! but not quite the output I expect. Any solution in Python or R is appreciated!

Please setup a [minimal](https://stackoverflow.com/help/minimal-reproducible-example), runnable, [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). If you are asking about summary stats why are you including plotting lines? Please focus question on *minimal* example. — Parfait, Jun 06 '19 at 01:51

score 2 · Accepted Answer · answered Jun 06 '19 at 01:54

I think you will find this guide useful: Running a model on separate groups.

Let's generate some example data similar to yours, with values for two variants and mean age. We also need a few packages:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE), 
                    snp01 = rnorm(50), 
                    snp02 = rnorm(50))

The first step is to transform from "wide" to "long" format using gather, so as variant names are in one column and values in another. Then we can nest by variant name.

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp)

This creates a tibble (a special data frame) where the second column, data is a "list column" - it contains mean ages and the values for the variant in that row:

# A tibble: 2 x 2
  snp   data             
  <chr> <list>           
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>

Now we use purrr::map to create a third column with the linear model for each row:

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)))

Result:

# A tibble: 2 x 3
  snp   data              model 
  <chr> <list>            <list>
1 snp01 <tibble [50 x 2]> <lm>  
2 snp02 <tibble [50 x 2]> <lm>

The last step is to summarise the models as desired, then unnest the data structure. I'm using broom::glance(). The full procedure:

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)), 
         summary = map(model, glance)) %>% 
  select(-data, -model) %>% 
  unnest(summary)

Result:

# A tibble: 2 x 12
  snp   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 snp01   0.00732      -0.0134   12.0     0.354   0.555     2  -194.  394.  400.    6901.          48
2 snp02   0.0108       -0.00981  12.0     0.524   0.473     2  -194.  394.  400.    6877.          48

You can use `broom::tidy()` instead of glance to get the intercept values. — neilfws, Jun 06 '19 at 22:08
Would it be possible to also include the regression slope in the summary output? — RJF, Jun 18 '19 at 19:10
`broom::tidy` returns the estimate, which is the slope for predictor variables. — neilfws, Jun 18 '19 at 22:09
Neil, I was wondering if you could help me with yet another question. I'd like to use `robust::lmrob()` with MM-estimator, but it is not supported by the glance or tidy. Do you have any suggestion as to how I could gather the summary statistic as neat as the snippet above? — RJF, Jul 11 '19 at 15:23

score 1 · Answer 2 · answered Jun 06 '19 at 01:59

I do not know the exact detail and complexity of your data and analysis, but this is the approach I would take.

data <- data.frame(mean_age=rnorm(5),
                   Column_1=rnorm(5),
                   Column_2=rnorm(5),
                   Column_3=rnorm(5),
                   Column_4=rnorm(5),
                   Column_5=rnorm(5)
                   )
data


looped <- list()

for(each_col in names(data)[-1]){
    looped[[each_col]] <- lm(get(each_col) ~ mean_age, data)

}

looped

Fitting regression multiple times and gather summary statistics

2 Answers2

Linked