A Linear Regression model is one of the most popular techniques in analytics and econometrics for understanding the relationship between a dependent variable, and other independent variables. It is often subsequently used for predicting and forecasting that dependent variable when it is not known.
It is used when your dependent variable is continuous.
A Linear Regression model builds the rule for a regression line - which is essentially a 'line of best fit' through the dependent variable and independent variable. For example, our dependent variable might be men's weights, and our independent variable is those same men's heights. We want to know what the relationship between height and weight is, and if we can predict a man's weight based only on his height.
The model builds the rule for this line in the format of y = a + bx where y is the dependent variable (weight), a is the 'intercept' i.e. how far up the y axis do we start our line of best fit, and bx is the slope of that line, determined by x (height) multiplied by b (a weighting or co-efficient provided by the model).
We can run through this exact example from above, in R!
We have a set of 100 men and their weights (kg)
weight <- c(88,80,82,78,91,98,82,80,93,89,90,91,90,97,82,92,86,88,85,94, 98,72,78,82,91,95,97,89,84,92,93,77,84,82,80,85,78,71,84,89, 97,77,74,92,97,90,93,80,74,87,94,83,90,91,85,84,88,75,90,88, 83,87,86,76,81,98,89,92,81,98,99,86,83,88,79,75,93,75,89,93, 94,81,86,97,84,96,92,95,96,89,88,80,80,85,90,90,89,88,75,81)
Let's have a quick look at the distribution of men's weights:
hist(weight, breaks = 5, col = "light grey")
We also have the heights of those same men (cm)...
height <- c(184,169,160,163,180,187,176,163,186,165,165,170,174,178,163,181,177,182,187,187,183,161,171,175,172,182,177,180,182,178,178,167,171,173,176,173,170,151,184,172, 200,167,162,171,185,170,170,173,161,163,179,183,170,175,165,168,187,167,181,176, 169,175,164,167,166,186,167,193,167,184,185,179,179,169,171,166,170,165,183,172,184,169,164,183,175,193,178,185,187,161,179,173,159,180,180,184,162,180,160,158)
...and we have wrist girth for these men as well (cm)
wrist <- c(20.2,19.8,21.7,18.0,23.0,23.3,19.0,21.0,24.8,21.7,23.2,21.1,22.0,23.3,21.8,22.8,22.7,23.2,22.4,23.3,25.5,18.5,20.3,18.5,22.7,22.6,23.8,22.7,20.4,21.4,22.2,19.9,20.7,21.7,18.4,21.0,21.4,19.3,22.5,21.0,22.8,19.9,17.9,23.6,23.5,21.3,24.4,18.4,18.4,21.8,23.9,20.1,22.5,24.6,21.2,22.6,23.7,19.2,22.1,20.6,22.5,21.0,19.7,20.8,21.1,23.1,22.4,24.8,20.6,24.1,25.3,20.8,20.0,20.9,19.2,20.7,21.9,17.1,20.8,24.0,24.0,21.8,22.2,25.2,21.1,24.6,24.3,24.9,25.9,22.0,21.2,19.6,18.0,20.0,23.9,21.4,21.2,20.3,17.7,21.2)
Let's compile these all together data into a data frame that we'll call men_stats
men_stats <- data.frame(cbind(weight,height,wrist))
Before we get onto building the model, let's quickly visualise any relationship between the two:
plot(men_stats$weight,men_stats$height, main = "Plot of Mens Height vs Weight", xlab = "Weight (kg)", ylab = "Height (cm)", pch = 19, col = "blue")
There does appear to be a positive relationship, and this is backed up with a reasonably high correlation of 0.67
cor(men_stats$weight,men_stats$height) #0.6718935
Right, let's build our model to understand how accurately we can predict weight using only height!
Firstly, we'll split our data into two sets, half will go into a set for training the model and we'll leave half aside to validate the model on. To do this, we can utilise some the code from our other post on randomly splitting data into training and validation sets
#specify what proportion of data we want to use to train the model - we'll use a 50/50 split here
training_size <- 0.5
#use the sample function to select random rows from our data to meet the proportion specified above set.seed(123) training_rows <- sample(seq_len(nrow(men_stats)), size = floor(training_size * nrow(men_stats)))
#assign the training set training <- men_stats[training_rows, ]
#assign the validation set validation <- men_stats[-training_rows, ]
Note: make sure you use seed 123 above - it'll ensure that you'll be able to reproduce exactly the same results as in this example
Now let's train the model!
#build the model and save it as an object fit.lm1 fit.lm1 <- lm(weight ~ height, data = training)
#get the stats for the model summary(fit.lm1)
Under Coefficients in the summary above, you'll see a value under Estimate for the Intercept (6.19000) and for height (0.46680)
Remembering the formula for a regression line of y = a + bx, we can plug these values in for a man who is 175cm tall to estimate his weight based on the regression line built by our model
y = a + bx
weight = intercept + (height * height coefficient)
y = 6.19 + 175 * 0.46680
y = 87.77kg
Our model predicts a man who is 175cm tall to be approximately 87.8kg - we can check the accuracy of the models predictions in a moment. Before that, let's look again to the Summary and find Multiple R-squared and Adjusted R-squared near the bottom right. These values are the most common way of understanding how 'good' the model is, specifically it tells you how much of the variance within the data our model is capturing. An r-squared value will always be between 0 and 1, the higher the better. Adjusted r-squared is generally more reliable as it penalises the model slightly for each extra variable you have in the model. Looking at our Adjusted r-squared, we have a value of 0.423. This suggests that the model works but it's not that good, it tells us that there is a lot of variance in men's weight that our model isn't understanding and therefore the regression rule it's built means many predictions may be too high or too low. These errors are technically called 'residuals' and we can plot them to see exactly how far out our weight values are from the regression line:
plot(fit.lm1$residuals, main = "Residuals for fit.lm1", ylab = "Residual", xlab = "", ylim = c(-12,12), pch = 19, col = "red") abline(h=0)
Some weight values are close to the regression line, but a lot are too high or too low!
Now the true test - let's try using our model to predict some men's weights based on data the model hasn't seen. We split out a validation set earlier for this very purpose. Remember, the model has never seen this particular data so will be relying completely on the rule it build when being trained above.
Using the predict functionality, we can create our predictions on the validation set using fit.lm1 - and we can add these as a column to our validation set making it easy to see how well our model performs.
validation$prediction.lm1 <- predict(fit.lm1, validation)
When you look through the data, you can see that again some predictions are close but many are too high or too low.
A good metric to assess how accurately your model is predicting is mean squared error (MSE) which is essentially finding what the average error is but ensuring that we square the errors so we're always dealing with positive numbers
#MSE for lm1 mean(abs(validation$prediction.lm1 - validation$weight)) A MSE of 4.6 or, on average our model is out by 4.6kg.
Is this good? Could we do better?
What if we had some other data that we thought could help predict weights, can we add that into the equation to try and improve our predictive accuracy? Yes we can! this is called multi-variate linear regression and is very similar with the formula essentially expanding to incorporate the extra variables.
y = a + bx + cx ... Conveniently, we have wrist girth in our data, so can add this into the mix and see if it helps predict men's weight. All we need to do in r is add it to the model formula using the + sign
#build the model and save it as an object fit.lm2 fit.lm2 <- lm(weight ~ height + wrist, data = training)
#get the stats for the model summary(fit.lm2)
The adjusted r-squared is much higher now, up to 0.8114! Wrist girth must be a good predictor of weight.
As we now have two variables in our equation (height and wrist girth) let's run through the formula for an example man who is 175cm tall and who has a 21cm wrist girth:
y = a + bx + cx
weight = intercept + (height * height coefficient) + (wrist girth * wrist girth coefficient)
y = 7.0192 + (0.1721 * 175) + (2.3138 * 21)
y = 85.7265kg
Let's check our residuals in this new model, and compare them to model 1
par(mfrow=c(2,1)) plot(fit.lm1$residuals, main = "Residuals for fit.lm1", ylab = "Residual", xlab = "", ylim = c(-12,12), pch = 19, col = "red") abline(h=0)
plot(fit.lm2$residuals, main = "Residuals for fit.lm2", ylab = "Residual", xlab = "", ylim = c(-12,12), pch = 19, col = "green") abline(h=0)
It looks as though there is still a lot of over and under prediction happening but these aren't so severe.
Another way to visualse this is using a histogram, and it might make it clearer to see the difference between model 1 and model 2
par(mfrow=c(2,1)) hist(fit.lm1$residuals, main = "Spread of Error: fit.lm1", xlim = c(-15,15), ylim = c(0,20), breaks = 7, col = "red") hist(fit.lm2$residuals, main = "Spread of Error: fit.lm2", xlim = c(-15,15), ylim = c(0,20), breaks = 7, col = "green")
Oour first model has quite a spread of errors, whereas with model two these are frequently closer to 0
Again, the true test - let's use our second model to predict the weights of the men in our validation set, we'll add another column to our validation set with these predictions
validation$prediction.lm2 <- predict(fit.lm2, validation)
Hard to tell if it's better at an overall level - calculated the MSE should tell us
#MSE for lm2 mean(abs(validation$prediction.lm2 - validation$weight))
A MSE of 3.58 for model two or, on average our model is out by approximately 3.6kg on average. There is still some error, but we've made an improvement of around 1kg in accuracy!
We hope you've enjoyed this introduction to Linear Regression modelling in R - please let us know your thoughts, and share using the social buttons below