First, what is a dummy variable and why do we need them?
What does machine learning do with labels like “United States” when trying to figure out how to process data? These models cannot use these labels in mathematical operations. “1 + United States” does not have a result. So, these labels (commonly referred to as “categorical” variables) need to be converted to something upon which operations can occur.
Let’s make a very simple example. You are trying to use (multiple) linear regression to figure out the effects on the salaries of workers of the following variables:
- the countries in which the workers are employed
- the age of the workers
- the number of years the workers have been on the job
You have a list of salaries and you want plot the salaries and use machine learning to be able in future to estimate salary by country, age and years on the job. Salary is your dependent variable (the one you want to watch change in response to the other variable changes). The other variables are your independent variables.
With dummy variables:
As already noted, categorical variables need to be converted to numerical values. We do not want to do this in one column, as our machine learning model might think there is a difference in values between these variables. If “United States” is given a value of “1” and “Canada” is given a value of “2”, “United States” might be considered numerically more (or less, depending on our logic) significant. To resolve this issue, we create “dummy variables”, giving each variable its own column and providing a 0 or 1 (0 if “no” and 1 if “yes”). Our dataset which contains the dummy variables might look like the following:
Second, what is the trap?
Imagine that you have a dataset with the constant “1” and dummy columns for “male” and “female”. The male and female columns will add up to “1”, which is equal to the constant column. This “1” equals the constant regressor and the regression equation becomes unsolvable. The solution? Either remove the constant or one of the dummy variables. Back to our example – like the male versus female example, the country in our dataset must be either “United States” or “Canada”, so we can remove one of these to avoid the Dummy Variable Trap.
With constant and both dummy variables:
With constant and one dummy variable (United States dummy variable removed):
We have now avoided the Dummy Variable Trap in this dataset!