Before we even get started, let's create some basic data to support the example. Hypothetically, let's presume we had some data for regression modelling purposes, perhaps we wanted to predict height based on nationality.
id <- factor(1:10) height <- round(175 + rnorm(10)*10) nationality <- c("AUS","UK","NZ","NZ","AUS","UK","NZ","UK","NZ","NZ")
mydata <- data.frame(cbind(id,height,nationality))
What is a One Hot Encoding?
One hot encoding is a representation of categorical variables as binary vectors. What this means is that we want to
transform a categorical variable or variables to a format that works better with classification and regression algorithms.
We also sometimes call these dummy variables.
Why is it necessary, and when?
In their purest form, regression models treat all independent variables as numeric. If we have non numeric data that we think may be important, we want to be able to use this in the model. In the data above, nationality is a categorical variable and therefore the regression algorithm won't be able to process it. Often, it will translate each categorical variable into "categorical values", for example it will assign AUS as 1, UK as 2, and NZ as 3. The algorithm will try predict height using these numerical values.
The problem we face here, is that, despite this authors viewpoint, it's not fair to say that NZ > AUS - they are just different categories, without any order to them. We need an approach to counter this, and allow us to fairly understand the relationship between the different nationalities and height.
How do we do it?
We want to create new columns, one for each nationality. Each new column will have a 1 or a 0 to show whether each person is from that country or not. While there are packages in R designed to do this ('dummies' for example) - one way to do this in base R, is using a loop.
for(unique_value in unique(mydata$nationality)){
mydata[paste("nationality", unique_value, sep = ".")] <- ifelse(mydata$nationality == unique_value, 1, 0) }
Above, we've created a loop that searches through the variable in question (nationality) and finds all the unique values. For each of those unique values it then creates a new column and assigns a 1 or a 0 depending on if that is the nationality of the person in that row
Lets look at our data set now:
Perfect, we're now ready to run the model!
Note: It's worth mentioning that the model won't actually need all the dummy variables, and you'll often see one missing in the model summary. There is a very logical reason for this, the model doesn't need the final dummy variable as it already has deduced that information from the combination of all other dummy variables!