Python

Pandas drop duplicates – Remove Duplicate Rows

By ayed_amira , on 12/08/2020 , updated on 12/08/2020 - 5 minutes to read
pandas drop duplicates

Pandas drop duplicates : In this article we will see how to remove duplicate rows and keep only the unique values of a pandas dataframe.

Introduction

A pandas dataframe is a two-dimensional tabular data structure that can be modified in size with labeled axes that are commonly referred to as row and column labels, with different arithmetic operations aligned with the row and column labels.

In a dataset, it very often happens that there are duplicate rows, this can be very problematic when performing arithmetic operations for example.

In this tutorial, we will cover the following points:

  • Remove duplicate rows keeping the first row.
  • Remove duplicate rows keeping the last row.
  • Remove all duplicate rows from our dataframe.
  • Remove duplicate rows in relation to specific columns.

To illustrate these different points, we will use the following pandas dataframe:

import pandas as pd

marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
         'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
        }
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])

print(df)
             Heros            Movies
0             Thor          Avengers
1        Spiderman  Avengers Endgame
2             Thor  Avengers Endgame
3         Iron man          Avengers
4  Captain America          Avengers
5         Iron man          Avengers
6             Hulk          Avengers
7         Iron man  Avengers Endgame

Pandas drop duplicates

Pandas drop duplicates() Syntax

The drop_duplicates() function is used to remove duplicate rows from a pandas dataframe. Its syntax is as follows:

# drop_duplicates() syntax

drop_duplicates(subset=None, keep="first", inplace=False)

The function can take 3 optional parameters :

  • subset: label or list of columns to identify duplicate rows. By default, all columns are included.
  • keep : the available values are first, last and False. If “first“, the duplicate rows except the first one are deleted. If “last“, the duplicate rows are deleted except the last one. If “False“, all duplicate rows are deleted.
  • inplace: if True, the initial DataFrame is modified and the value None is returned. By default, the initial DataFrame is not modified and a new dataframe is created.

Remove duplicate rows keeping the first row

By default, the keep parameter is assigned the value first. To keep the first one it is not necessary to specify this parameter :

import pandas as pd

marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
         'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
        }
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])

df1 = df.drop_duplicates()

print(df1)
             Heros            Movies
0             Thor          Avengers
1        Spiderman  Avengers Endgame
2             Thor  Avengers Endgame
3         Iron man          Avengers
4  Captain America          Avengers
6             Hulk          Avengers
7         Iron man  Avengers Endgame

The index 5 has been removed from our dataframe because this row is similar to the row with the index 3.

Remove duplicate rows keeping the last row

This time in order to keep the last row in duplicates, it is necessary to specify the keep parameter with the value “last” :

import pandas as pd

marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
         'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
        }
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])

df1 = df.drop_duplicates(keep="last")

print(df1)
             Heros            Movies
0             Thor          Avengers
1        Spiderman  Avengers Endgame
2             Thor  Avengers Endgame
4  Captain America          Avengers
5         Iron man          Avengers
6             Hulk          Avengers
7         Iron man  Avengers Endgame

By specifying the keep=’last’ parameter, index 3 has been removed from our dataframe instead of index 5.

Remove all duplicate rows from our dataframe

If you want to remove all duplicate rows, you can use parameter keep=False :

import pandas as pd

marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
         'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
        }
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])

df1 = df.drop_duplicates(keep=False)

print(df1)
             Heros            Movies
0             Thor          Avengers
1        Spiderman  Avengers Endgame
2             Thor  Avengers Endgame
4  Captain America          Avengers
6             Hulk          Avengers
7         Iron man  Avengers Endgame

Index 3 and 5 have been removed from the dataframe.

Remove duplicate rows in relation to specific columns

By default, to check if a row is in duplicate, the function look on all the columns of the dataframe. To specify that one or more columns, we can use the subset parameter :

import pandas as pd

marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
         'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
        }
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])

df1 = df.drop_duplicates(subset="Heros")

print(df1)
             Heros            Movies
0             Thor          Avengers
1        Spiderman  Avengers Endgame
3         Iron man          Avengers
4  Captain America          Avengers
6             Hulk          Avengers

Conclusion

In this tutorial we learned how to use the drop_duplicates() function present in the pandas module to remove duplicate rows from a dataframe. This function is used a lot for data cleaning. If you have any questions about its use, don’t hesitate to leave me a comment, I will be happy to answer them.

See you soon for new tutorials.

Back to Python Menu

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.