PySpark Distinct Value of a Column

Pyspark Distinct : In this tutorial we will see how to get the distinct values of a column in a Dataframe Pyspark.
Introduction
It can be interesting to know the distinct values of a column to verify, for example, that our column does not contain any outliers or simply to have an idea of what it contains. There are two methods to do this:
- distinct() function: which allows to harvest the distinct values of one or more columns in our Pyspark dataframe
- dropDuplicates() function: Produces the same result as the distinct() function.
For the rest of this tutorial, we will go into detail on how to use these 2 functions.
To do so, we will use the following dataframe:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
datavengers = [
("Carol","Data Scientist","USA",70000,5),
("Bruce","Data Engineer","UK",80000,4),
("Xavier","Marketing","USA",100000,11),
("Peter","Data Scientist","USA",90000,7),
("Clark","Data Scientist","UK",111000,10),
("T'challa","CEO","USA",300000,20),
("Jean","Data Scientist","UK",220000,30),
("Thanos","Data Engineer","USA",115000,13),
("Scott","Data Engineer","UK",180000,15),
("Wade","Marketing","UK",60000,2)
]
schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
df.show(truncate=False)
root
|-- Name: string (nullable = true)
|-- Job: string (nullable = true)
|-- Country: string (nullable = true)
|-- salary: long (nullable = true)
|-- seniority: long (nullable = true)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol |Data Scientist|USA |70000 |5 |
|Bruce |Data Engineer |UK |80000 |4 |
|Xavier |Marketing |USA |100000|11 |
|Peter |Data Scientist|USA |90000 |7 |
|Clark |Data Scientist|UK |111000|10 |
|T'challa|CEO |USA |300000|20 |
|Jean |Data Scientist|UK |220000|30 |
|Thanos |Data Engineer |USA |115000|13 |
|Scott |Data Engineer |UK |180000|15 |
|Wade |Marketing |UK |60000 |2 |
+--------+--------------+-------+------+---------+
Distinct value of a column in pyspark using distinct()
The 1st method consists in using the distinct() function of Pyspark. Its syntax is as follows:
# Distinct() function
df.select("Job").distinct().show(truncate=False)
+--------------+
|Job |
+--------------+
|CEO |
|Data Scientist|
|Marketing |
|Data Engineer |
+--------------+
We can see that the function did return the distinct values of the Job column. In our example, we have returned only the distinct values of one column but it is also possible to do it for multiple columns. Here is how to do it:
# distinct() multiple column
df.select("Job","Country").distinct().show(truncate=False)
+--------------+-------+
|Job |Country|
+--------------+-------+
|Marketing |UK |
|Data Engineer |UK |
|Data Scientist|UK |
|Marketing |USA |
|Data Scientist|USA |
|CEO |USA |
|Data Engineer |USA |
+--------------+-------+
Distinct value of a column in pyspark using dropDuplicates()
The dropDuplicates() function also makes it possible to retrieve the distinct values of one or more columns of a Pyspark Dataframe.
To use this function, you need to do the following:
# dropDuplicates() single column
df.dropDuplicates((['Job'])).select("Job").show(truncate=False)
+--------------+
|Job |
+--------------+
|CEO |
|Data Scientist|
|Marketing |
|Data Engineer |
+--------------+
With multiple columns this gives :
# dropDuplicates() multiple column
df.dropDuplicates((['Job','Country'])).select("Job","Country").show(truncate=False)
the function returns a new dataframe with the duplicate rows removed.
Conclusion
These functions can be very useful when we want to delete rows that contain exactly the same data. I still advise you to check before doing this kind of thing to avoid making unwanted mistakes.
I hope that this tutorial has helped you better understand these 2 functions. Don’t hesitate to share in comments if something is blocking you in the use of these methods.
If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :
See you soon for new tutorials ! 🙂
Comments
Leave a comment