Pandas drop duplicates – Remove Duplicate Rows

Pandas drop duplicates : In this article we will see how to remove duplicate rows and keep only the unique values of a pandas dataframe.
Introduction
A pandas dataframe is a two-dimensional tabular data structure that can be modified in size with labeled axes that are commonly referred to as row and column labels, with different arithmetic operations aligned with the row and column labels.
In a dataset, it very often happens that there are duplicate rows, this can be very problematic when performing arithmetic operations for example.
In this tutorial, we will cover the following points:
- Remove duplicate rows keeping the first row.
- Remove duplicate rows keeping the last row.
- Remove all duplicate rows from our dataframe.
- Remove duplicate rows in relation to specific columns.
To illustrate these different points, we will use the following pandas dataframe:
import pandas as pd
marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
}
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])
print(df)
Heros Movies
0 Thor Avengers
1 Spiderman Avengers Endgame
2 Thor Avengers Endgame
3 Iron man Avengers
4 Captain America Avengers
5 Iron man Avengers
6 Hulk Avengers
7 Iron man Avengers Endgame
Pandas drop duplicates
Pandas drop duplicates() Syntax
The drop_duplicates() function is used to remove duplicate rows from a pandas dataframe. Its syntax is as follows:
# drop_duplicates() syntax
drop_duplicates(subset=None, keep="first", inplace=False)
The function can take 3 optional parameters :
- subset: label or list of columns to identify duplicate rows. By default, all columns are included.
- keep : the available values are first, last and False. If “first“, the duplicate rows except the first one are deleted. If “last“, the duplicate rows are deleted except the last one. If “False“, all duplicate rows are deleted.
- inplace: if True, the initial DataFrame is modified and the value None is returned. By default, the initial DataFrame is not modified and a new dataframe is created.
Remove duplicate rows keeping the first row
By default, the keep parameter is assigned the value first. To keep the first one it is not necessary to specify this parameter :
import pandas as pd
marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
}
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])
df1 = df.drop_duplicates()
print(df1)
Heros Movies
0 Thor Avengers
1 Spiderman Avengers Endgame
2 Thor Avengers Endgame
3 Iron man Avengers
4 Captain America Avengers
6 Hulk Avengers
7 Iron man Avengers Endgame
The index 5 has been removed from our dataframe because this row is similar to the row with the index 3.
Remove duplicate rows keeping the last row
This time in order to keep the last row in duplicates, it is necessary to specify the keep parameter with the value “last” :
import pandas as pd
marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
}
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])
df1 = df.drop_duplicates(keep="last")
print(df1)
Heros Movies
0 Thor Avengers
1 Spiderman Avengers Endgame
2 Thor Avengers Endgame
4 Captain America Avengers
5 Iron man Avengers
6 Hulk Avengers
7 Iron man Avengers Endgame
By specifying the keep=’last’ parameter, index 3 has been removed from our dataframe instead of index 5.
Remove all duplicate rows from our dataframe
If you want to remove all duplicate rows, you can use parameter keep=False :
import pandas as pd
marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
}
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])
df1 = df.drop_duplicates(keep=False)
print(df1)
Heros Movies
0 Thor Avengers
1 Spiderman Avengers Endgame
2 Thor Avengers Endgame
4 Captain America Avengers
6 Hulk Avengers
7 Iron man Avengers Endgame
Index 3 and 5 have been removed from the dataframe.
Remove duplicate rows in relation to specific columns
By default, to check if a row is in duplicate, the function look on all the columns of the dataframe. To specify that one or more columns, we can use the subset parameter :
import pandas as pd
marvel = {'Heros': ['Thor', 'Spiderman', 'Thor', 'Iron man', 'Captain America', 'Iron man', 'Hulk', 'Iron man'],
'Movies': ['Avengers', 'Avengers Endgame', 'Avengers Endgame', 'Avengers', 'Avengers', 'Avengers', 'Avengers', 'Avengers Endgame']
}
df = pd.DataFrame(marvel, columns=['Heros', 'Movies'])
df1 = df.drop_duplicates(subset="Heros")
print(df1)
Heros Movies
0 Thor Avengers
1 Spiderman Avengers Endgame
3 Iron man Avengers
4 Captain America Avengers
6 Hulk Avengers
Conclusion
In this tutorial we learned how to use the drop_duplicates() function present in the pandas module to remove duplicate rows from a dataframe. This function is used a lot for data cleaning. If you have any questions about its use, don’t hesitate to leave me a comment, I will be happy to answer them.
See you soon for new tutorials.
Comments
Leave a comment