Regression or linear models are used to show a relationship between two variables. Basically, you take data points for two variables and then you can attempt to create a mathematical model based on these data points. If you succeed, in the future you will be able to guess the value of one variable (the dependent variable) based on the value of another variable (the independent variable).
This example from DataCamp shows how to use the lm (for linear model) function to build a linear model that can predict height based on weight. The data used is the
>str(bdims) 'data.frame': 507 obs. of 25 variables: $ bia.di: num 42.9 43.7 40.1 44.3 42.5 43.3 43.5 44.4 43.5 42 ... $ bii.di: num 26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ... $ bit.di: num 31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ... $ che.de: num 17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ... $ che.di: num 28 30.8 31.7 28.2 29.4 31.3 31.7 28.8 27.5 28 ... $ elb.di: num 13.1 14 13.9 13.9 15.2 14 16.1 15.1 14.1 15.6 ... $ wri.di: num 10.4 11.8 10.9 11.2 11.6 11.5 12.5 11.9 11.2 12 ... $ kne.di: num 18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ... $ ank.di: num 14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ... $ sho.gi: num 106 110 115 104 108 ... $ che.gi: num 89.5 97 97.5 97 97.5 ... $ wai.gi: num 71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ... $ nav.gi: num 74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ... $ hip.gi: num 93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ... $ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ... $ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ... $ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ... $ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ... $ cal.gi: num 36.5 37.5 37.3 34.8 38.6 36.1 40.3 36.7 35 38.6 ... $ ank.gi: num 23.5 24.5 21.9 23 24.4 23.5 23.6 22.5 22 22.2 ... $ wri.gi: num 16.5 17 16.9 16.6 18 16.9 18.8 18 16.5 16.9 ... $ age : int 21 23 28 23 22 21 26 27 23 21 ... $ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ... $ hgt : num 174 175 194 186 187 ... $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
Fit Model
To create the linear model, just use the lm function. The lm function will include a formula that looks like two variables separated by a ~. The first variable is the dependent variable and the second is the independent. The idea is that you can guess the first variable based on the value of the second.
# Linear model for weight as a function of height >lm(wgt ~ hgt, data = bdims) Call: lm(formula = wgt ~ hgt, data = bdims) Coefficients: (Intercept) hgt -105.011 1.018
Inspect Model
To inspect the details of a linear model use the coef and summary functions. These functions will show you the coefficients (the important parts) and the summary information that you can use to assess the model.
> coef(lm(wgt ~ hgt, data = bdims)) (Intercept) hgt -105.011254 1.017617
These are the numbers that you could use to plot the line or make predictions (which you will not do by hand btw). To see more details about the model you can use the summary function.
> summary(lm(wgt ~ hgt, data = bdims)) Call: lm(formula = wgt ~ hgt, data = bdims) Residuals: Min 1Q Median 3Q Max -18.743 -6.402 -1.231 5.059 41.103 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -105.01125 7.53941 -13.93 <2e-16 *** hgt 1.01762 0.04399 23.14 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 9.308 on 505 degrees of freedom Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136 F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
Use Model to Predict
Use the predict function to apply your model to new data and guess what the values of the dependent variable will be:
> # FIT MODEL AND ASSIGN TO OBJECT > wgt_predictor <- lm(formula = wgt ~ hgt, data = bdims) > > # WE KNOW RICK'S HEIGHT BUT NOT WEIGHT > rick <- data.frame(hgt = 190) > > # USE MODEL TO GUESS RICK'S WEIGHT > predict(wgt_predictor, newdata = rick) 1 88.33593
Visualize Linear Model
You can visualize the data points and the line that represents the linear model with the ggplot2 package like this:
ggplot(data = bdims, aes(x = hgt, y = wgt)) + geom_point() + geom_abline(data = coefs, aes(intercept = `(Intercept)`, slope = hgt), color = "dodgerblue")
This fill create the following plot:
