data science

Linear Models in R

Regression or linear models are used to show a relationship between two variables. Basically, you take data points for two variables and then you can attempt to create a mathematical model based on these data points. If you succeed, in the future you will be able to guess the value of one variable (the dependent variable) based on the value of another variable (the independent variable).

This example from DataCamp shows how to use the lm (for linear model) function to build a linear model that can predict height based on weight. The data used is the bdims dataset which is summarized below:

>str(bdims)

'data.frame':	507 obs. of  25 variables:
 $ bia.di: num  42.9 43.7 40.1 44.3 42.5 43.3 43.5 44.4 43.5 42 ...
 $ bii.di: num  26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
 $ bit.di: num  31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
 $ che.de: num  17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
 $ che.di: num  28 30.8 31.7 28.2 29.4 31.3 31.7 28.8 27.5 28 ...
 $ elb.di: num  13.1 14 13.9 13.9 15.2 14 16.1 15.1 14.1 15.6 ...
 $ wri.di: num  10.4 11.8 10.9 11.2 11.6 11.5 12.5 11.9 11.2 12 ...
 $ kne.di: num  18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
 $ ank.di: num  14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
 $ sho.gi: num  106 110 115 104 108 ...
 $ che.gi: num  89.5 97 97.5 97 97.5 ...
 $ wai.gi: num  71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
 $ nav.gi: num  74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
 $ hip.gi: num  93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
 $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
 $ bic.gi: num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
 $ for.gi: num  26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
 $ kne.gi: num  34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
 $ cal.gi: num  36.5 37.5 37.3 34.8 38.6 36.1 40.3 36.7 35 38.6 ...
 $ ank.gi: num  23.5 24.5 21.9 23 24.4 23.5 23.6 22.5 22 22.2 ...
 $ wri.gi: num  16.5 17 16.9 16.6 18 16.9 18.8 18 16.5 16.9 ...
 $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
 $ wgt   : num  65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
 $ hgt   : num  174 175 194 186 187 ...
 $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...

Fit Model

To create the linear model, just use the lm function. The lm function will include a formula that looks like two variables separated by a ~. The first variable is the dependent variable and the second is the independent. The idea is that you can guess the first variable based on the value of the second.

# Linear model for weight as a function of height
>lm(wgt ~ hgt, data = bdims)

Call:
lm(formula = wgt ~ hgt, data = bdims)

Coefficients:
(Intercept)          hgt  
   -105.011        1.018

Inspect Model

To inspect the details of a linear model use the coef and summary functions. These functions will show you the coefficients (the important parts) and the summary information that you can use to assess the model.

> coef(lm(wgt ~ hgt, data = bdims))

(Intercept)         hgt 
-105.011254    1.017617

These are the numbers that you could use to plot the line or make predictions (which you will not do by hand btw). To see more details about the model you can use the summary function.

> summary(lm(wgt ~ hgt, data = bdims))

Call:
lm(formula = wgt ~ hgt, data = bdims)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.743  -6.402  -1.231   5.059  41.103 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
hgt            1.01762    0.04399   23.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.308 on 505 degrees of freedom
Multiple R-squared:  0.5145,	Adjusted R-squared:  0.5136 
F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Use Model to Predict

Use the predict function to apply your model to new data and guess what the values of the dependent variable will be:

> # FIT MODEL AND ASSIGN TO OBJECT
> wgt_predictor <- lm(formula = wgt ~ hgt, data = bdims)
> 
> # WE KNOW RICK'S HEIGHT BUT NOT WEIGHT
> rick <- data.frame(hgt = 190)
> 
> # USE MODEL TO GUESS RICK'S WEIGHT
> predict(wgt_predictor, newdata = rick)

       1 
88.33593

Visualize Linear Model

You can visualize the data points and the line that represents the linear model with the ggplot2 package like this:

ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
       geom_point() + 
       geom_abline(data = coefs, 
                   aes(intercept = `(Intercept)`, slope = hgt),  
                   color = "dodgerblue")

This fill create the following plot:

Linear Model Visualized

Matt has worked as a data analyst, writer, counselor, and business owner for a total of 20 years. Since the start of his career he's been fascinated by technology and passionate about helping people use modern technology to hack their work and their lives.

Leave a Reply

Your email address will not be published. Required fields are marked *