Pandas Get Dummies (One-Hot Encoding) – pd.get_dummies()

Pandas Get Dummies: This tutorial will explain how to use the pd.get_dummies() function which allows you to easily one-hot encode your categorical data.
What does One-Hot encoding mean?
As a data scientist or data analyst, you often work with algorithms that cannot handle categorical data (like random forest for example). Basically, these are variables that take names or labels as values. Here are some examples:
- color (“blue”, “red”, “green”)
- country (“USA”, “France”, “India”)
- age range (“<18″,”>18 and <35″,”>=35″)
Therefore, One-hot encoding is an important machine learning step in preparing your data. This involves turning your categorical column into separate columns of 0s and 1s (a binary vectorization) depending on whether the value matches the column header. This is called getting dummy panda columns.
In python, we can use the Pandas module which contains a method called pd.get_dummies().
Loading our dataset
The first step is to import the libraries required to use this method and create a dataset to illustrate our examples in this tutorial:
import pandas as pd
df = pd.DataFrame.from_dict(
{
'heroes': ['Batman', 'Thor', 'Hulk', 'Spiderman', 'Flash'],
'publisher': ['DC COMICS', 'Marvel', 'Marvel', 'Marvel', 'DC COMICS'],
'Power': ['Medium', 'Strong', 'Strong', 'Medium', 'Low']
}
)
print(df)
Output:
heroes publisher Power
0 Batman DC COMICS Medium
1 Thor Marvel Strong
2 Hulk Marvel Strong
3 Spiderman Marvel Medium
4 Flash DC COMICS Low
Pandas Get Dummies
Syntax
The pandas function pd.get_dummies() allows you to transform your categorical into dummy indicator columns (columns of 0 and 1). Here is the full syntax of the function:
pandas.get_dummies(data,
prefix=None,
prefix_sep=’_’,
dummy_na=False,
columns=None,
sparse=False,
drop_first=False,
dtype=None)
Parameters
Name | Description | Type | Default Value | Required |
---|---|---|---|---|
data | Data of which to get dummy indicators. | array-like, Series, or DataFrame | – | Yes |
prefix | String to append DataFrame column names. | str, list of str, or dict of str | None | No |
prefix_sep | Specify what you want between your prefix and column names. | str | ‘_’ | No |
dummy_na | if you want to create a dummy column for your NA values. | bool | False | No |
columns | Column names in the DataFrame that needs to be encoded. If columns is None then all the columns with object or category dtype will be converted. | list-like | None | No |
sparse | Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False). | bool | False | No |
drop_first | Remove first level to get k-1 dummies out of k categorical levels. | bool | False | No |
dtype | Data type for new columns. Only a single dtype is allowed. | dtype | np.uint8 | No |
Pandas Get Dummies – Create Dummy Indicator columns
To use the pd.get_dummies() function to create dummy columns, we need to specify the Pandas Dataframe on which we want to create dummies.
Let’s try to transform our “publisher” column:
# Use the previous code to get the dataframe
dummy = pd.get_dummies(df['publisher'])
print(dummy)
Output:
DC COMICS Marvel
0 1 0
1 0 1
2 0 1
3 0 1
4 1 0
We can see that the output unfortunately does not include the other columns. To add these new columns to our existing dataframe, you can use the pandas concat function:
# Use concat function
df_dummy = pd.concat([df,dummy], axis=1)
print(df_dummy)
Output:
heroes publisher Power DC COMICS Marvel
0 Batman DC COMICS Medium 1 0
1 Thor Marvel Strong 0 1
2 Hulk Marvel Strong 0 1
3 Spiderman Marvel Medium 0 1
4 Flash DC COMICS Low 1 0
We see that the dummy columns have been added to our initial dataframe.
You can also keep only the transformed dummy columns in addition to the other columns with the column parameter :
dummy = pd.get_dummies(df, columns= ['publisher'])
print(dummy)
Output:
heroes Power publisher_DC COMICS publisher_Marvel
0 Batman Medium 1 0
1 Thor Strong 0 1
2 Hulk Strong 0 1
3 Spiderman Medium 0 1
4 Flash Low 1 0
You can see that the publisher column has been removed from the final dataframe and that two new dummy columns have been added.
Pandas Get Dummies – Adding a Prefix to Columns Names
In our example above, we see that the new dummy variables start with ‘publisher_‘. It is possible to change this prefix and separator with the prefix and prefix_sep parameters:
dummy = pd.get_dummies(df, columns=['publisher'], prefix="amiradata", prefix_sep="/")
print(dummy)
Output:
heroes Power amiradata/DC COMICS amiradata/Marvel
0 Batman Medium 1 0
1 Thor Strong 0 1
2 Hulk Strong 0 1
3 Spiderman Medium 0 1
4 Flash Low 1 0
How to One-Hot Encode a Pandas Dataframe
In our previous examples, we saw how to encode a categorical column in our data frame. If your dataframe contains a lot of categorical columns this can be disturbing.
Using the code below, we will loop over the multiple categorical columns that we will merge with our original dataframe. The last step will be to remove the categorical columns to avoid bouncing around in our data and allow this dataframe to be used in a machine learning algorithm.
Here is the code:
import pandas as pd
pd.options.display.max_columns = None # To print all the columns
pd.options.display.max_rows = None # # To print all the rows
df = pd.DataFrame.from_dict(
{
'heroes': ['Batman', 'Thor', 'Hulk', 'Spiderman', 'Flash'],
'publisher': ['DC COMICS', 'Marvel', 'Marvel', 'Marvel', 'DC COMICS'],
'Power': ['Medium', 'Strong', 'Strong', 'Medium', 'Low']
}
)
categorical_columns = ['publisher', 'Power']
for column in categorical_columns:
dummies = pd.get_dummies(df[column], prefix=column)
df = pd.concat([df, dummies], axis=1)
df = df.drop(columns=column)
print(df)
Output:
heroes publisher_DC COMICS publisher_Marvel Power_Low Power_Medium \
0 Batman 1 0 0 1
1 Thor 0 1 0 0
2 Hulk 0 1 0 0
3 Spiderman 0 1 0 1
4 Flash 1 0 1 0
Power_Strong
0 0
1 1
2 1
3 0
4 0
What Are The Potential Drawbacks of Using The pd.get_dummies() Function?
The encoding of categorical columns can be very useful if it is used wisely. On the other hand, its major drawback is that this function can create a lot of columns if your input columns contain a lot of categories or modalities.
I advise you to run the pd.Series.nunique() function to know how many new columns you will create.
Conclusion
In this post, we learned how to generate dummy variables from a categorical column and what one-shot encoding is.
I hope you enjoyed this article, if you have any questions about how to use this function please feel free to share them in comments.
Comments
Leave a comment