Pandas Get Dummies (One-Hot Encoding) – pd.get_dummies()

pandas get dummies

Pandas Get Dummies: This tutorial will explain how to use the pd.get_dummies() function which allows you to easily one-hot encode your categorical data.

What does One-Hot encoding mean?

As a data scientist or data analyst, you often work with algorithms that cannot handle categorical data (like random forest for example). Basically, these are variables that take names or labels as values. Here are some examples:

  • color (“blue”, “red”, “green”)
  • country (“USA”, “France”, “India”)
  • age range (“<18″,”>18 and <35″,”>=35″)

Therefore, One-hot encoding is an important machine learning step in preparing your data. This involves turning your categorical column into separate columns of 0s and 1s (a binary vectorization) depending on whether the value matches the column header. This is called getting dummy panda columns.

In python, we can use the Pandas module which contains a method called pd.get_dummies().

Loading our dataset

The first step is to import the libraries required to use this method and create a dataset to illustrate our examples in this tutorial:

import pandas as pd

df = pd.DataFrame.from_dict(
    {
        'heroes': ['Batman', 'Thor', 'Hulk', 'Spiderman', 'Flash'],
        'publisher': ['DC COMICS', 'Marvel', 'Marvel', 'Marvel', 'DC COMICS'],
        'Power': ['Medium', 'Strong', 'Strong', 'Medium', 'Low']
    }
)

print(df)

Output:

      heroes  publisher   Power
0     Batman  DC COMICS  Medium
1       Thor     Marvel  Strong
2       Hulk     Marvel  Strong
3  Spiderman     Marvel  Medium
4      Flash  DC COMICS     Low

Pandas Get Dummies

Syntax

The pandas function pd.get_dummies() allows you to transform your categorical into dummy indicator columns (columns of 0 and 1). Here is the full syntax of the function:

pandas.get_dummies(data,
prefix=None,
prefix_sep=’_’,
dummy_na=False,
columns=None,
sparse=False,
drop_first=False,
dtype=None)

Parameters

Name Description Type Default Value Required
dataData of which to get dummy indicators.array-like, Series, or DataFrameYes
prefixString to append DataFrame column names.str, list of str, or dict of strNoneNo
prefix_sepSpecify what you want between your prefix and column names.str‘_’ No
dummy_naif you want to create a dummy column for your NA values.bool False No
columnsColumn names in the DataFrame that needs to be encoded. If columns is None then all the columns with object or category dtype will be converted.list-likeNoneNo
sparseWhether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False). bool FalseNo
drop_firstRemove first level to get k-1 dummies out of k categorical levels. bool FalseNo
dtypeData type for new columns. Only a single dtype is allowed.dtype np.uint8No

Pandas Get Dummies – Create Dummy Indicator columns

To use the pd.get_dummies() function to create dummy columns, we need to specify the Pandas Dataframe on which we want to create dummies.

Let’s try to transform our “publisher” column:

# Use the previous code to get the dataframe

dummy = pd.get_dummies(df['publisher'])
print(dummy)
 

Output:

   DC COMICS  Marvel
0          1       0
1          0       1
2          0       1
3          0       1
4          1       0

We can see that the output unfortunately does not include the other columns. To add these new columns to our existing dataframe, you can use the pandas concat function:

# Use concat function

df_dummy = pd.concat([df,dummy], axis=1)
print(df_dummy)
 

Output:

      heroes  publisher   Power  DC COMICS  Marvel
0     Batman  DC COMICS  Medium          1       0
1       Thor     Marvel  Strong          0       1
2       Hulk     Marvel  Strong          0       1
3  Spiderman     Marvel  Medium          0       1
4      Flash  DC COMICS     Low          1       0

We see that the dummy columns have been added to our initial dataframe.

You can also keep only the transformed dummy columns in addition to the other columns with the column parameter :

dummy = pd.get_dummies(df, columns= ['publisher'])
print(dummy)
 

Output:

  heroes   Power  publisher_DC COMICS  publisher_Marvel
0     Batman  Medium                    1                 0
1       Thor  Strong                    0                 1
2       Hulk  Strong                    0                 1
3  Spiderman  Medium                    0                 1
4      Flash     Low                    1                 0

You can see that the publisher column has been removed from the final dataframe and that two new dummy columns have been added.

Pandas Get Dummies – Adding a Prefix to Columns Names

In our example above, we see that the new dummy variables start with ‘publisher_‘. It is possible to change this prefix and separator with the prefix and prefix_sep parameters:

dummy = pd.get_dummies(df, columns=['publisher'], prefix="amiradata", prefix_sep="/")

print(dummy)
 

Output:

 heroes   Power  amiradata/DC COMICS  amiradata/Marvel
0     Batman  Medium                    1                 0
1       Thor  Strong                    0                 1
2       Hulk  Strong                    0                 1
3  Spiderman  Medium                    0                 1
4      Flash     Low                    1                 0

How to One-Hot Encode a Pandas Dataframe

In our previous examples, we saw how to encode a categorical column in our data frame. If your dataframe contains a lot of categorical columns this can be disturbing.

Using the code below, we will loop over the multiple categorical columns that we will merge with our original dataframe. The last step will be to remove the categorical columns to avoid bouncing around in our data and allow this dataframe to be used in a machine learning algorithm.

Here is the code:

import pandas as pd

pd.options.display.max_columns = None # To print all the columns
pd.options.display.max_rows = None # # To print all the rows

df = pd.DataFrame.from_dict(
    {
        'heroes': ['Batman', 'Thor', 'Hulk', 'Spiderman', 'Flash'],
        'publisher': ['DC COMICS', 'Marvel', 'Marvel', 'Marvel', 'DC COMICS'],
        'Power': ['Medium', 'Strong', 'Strong', 'Medium', 'Low']
    }
)

categorical_columns = ['publisher', 'Power']

for column in categorical_columns:
    dummies = pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(columns=column)

print(df)

Output:

heroes  publisher_DC COMICS  publisher_Marvel  Power_Low  Power_Medium  \
0     Batman                    1                 0          0             1   
1       Thor                    0                 1          0             0   
2       Hulk                    0                 1          0             0   
3  Spiderman                    0                 1          0             1   
4      Flash                    1                 0          1             0   

   Power_Strong  
0             0  
1             1  
2             1  
3             0  
4             0  

What Are The Potential Drawbacks of Using The pd.get_dummies() Function?

The encoding of categorical columns can be very useful if it is used wisely. On the other hand, its major drawback is that this function can create a lot of columns if your input columns contain a lot of categories or modalities.

I advise you to run the pd.Series.nunique() function to know how many new columns you will create.

Conclusion

In this post, we learned how to generate dummy variables from a categorical column and what one-shot encoding is.

I hope you enjoyed this article, if you have any questions about how to use this function please feel free to share them in comments.

Back to Python Menu

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published. Required fields are marked *