Splitting data into Training & Validation sets for modelling in R

Andrew Jones
Nov 16, 2017
2 min read

When building a predictive model, it's a good idea to test how well it predicts on a new or unseen set of data-points to get a true gauge of how accurately it can predict when you let it loose in a real world scenario.

The purpose of a predictive model, is to predict what we do not know. Assessing a models predictive power on data points which it has learnt from is essentially cheating, and can lead to what is called over-fitting.

Over-fitting is where the rules for the trained model are too complicated for the entire population. In this situation, the model learns the very specific nuances of the training data, and when these precise rules are applied to another set of data (which has it's own very specific nuances) it struggles to apply them accurately. Instead, we need the model to learn to approximate a rule set for the entire population where it's able to predict accurately across any set of data points, even those which is has not yet seen.

Below, we run through some simple code to split our data into a training set and a validation set:

#specify what proportion of data we want to train the model

training_size <- 0.75

#use the sample function to select random rows from our data to meet the proportion specified above

training_rows <- sample(seq_len(nrow(mydata)), size = floor(training_size * nrow(mydata)))

#training set

mydata_training <- mydata[training_rows, ]

#validation set mydata_validation <- mydata[-training_rows, ]

Optionally, you can set a seed before selecting the random rows to ensure reproducible training and validation sets. This can help when trying to accurately isolate improvements you've gained from tuning your model

#use the sample function to select random rows from our data to meet the proportion specified above

set.seed(456)

training_rows <- sample(seq_len(nrow(mydata)), size = floor(training_size * nrow(mydata)))

Hopefully this helps you in your modelling endeavours, please feel free to share

#analyticslink #R #modelling #andrewjones

Splitting data into Training & Validation sets for modelling in R

Recent Posts

Comentários

admin@analytics-link.com