Python

PySpark Distinct Value of a Column

By ayed_amira , on 09/10/2020 , updated on 09/10/2020 - 4 minutes to read
pyspark distinct value of a column

Pyspark Distinct : In this tutorial we will see how to get the distinct values of a column in a Dataframe Pyspark.

Introduction

It can be interesting to know the distinct values of a column to verify, for example, that our column does not contain any outliers or simply to have an idea of what it contains. There are two methods to do this:

For the rest of this tutorial, we will go into detail on how to use these 2 functions.

To do so, we will use the following dataframe:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
 
datavengers = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Xavier","Marketing","USA",100000,11),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("T'challa","CEO","USA",300000,20),
    ("Jean","Data Scientist","UK",220000,30),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("Wade","Marketing","UK",60000,2)
]

schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Name: string (nullable = true)
 |-- Job: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- seniority: long (nullable = true)

+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol   |Data Scientist|USA    |70000 |5        |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Xavier  |Marketing     |USA    |100000|11       |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|T'challa|CEO           |USA    |300000|20       |
|Jean    |Data Scientist|UK     |220000|30       |
|Thanos  |Data Engineer |USA    |115000|13       |
|Scott   |Data Engineer |UK     |180000|15       |
|Wade    |Marketing     |UK     |60000 |2        |
+--------+--------------+-------+------+---------+

Distinct value of a column in pyspark using distinct()

The 1st method consists in using the distinct() function of Pyspark. Its syntax is as follows:

# Distinct() function 

df.select("Job").distinct().show(truncate=False)
+--------------+
|Job           |
+--------------+
|CEO           |
|Data Scientist|
|Marketing     |
|Data Engineer |
+--------------+

We can see that the function did return the distinct values of the Job column. In our example, we have returned only the distinct values of one column but it is also possible to do it for multiple columns. Here is how to do it:

# distinct() multiple column

df.select("Job","Country").distinct().show(truncate=False)
+--------------+-------+
|Job           |Country|
+--------------+-------+
|Marketing     |UK     |
|Data Engineer |UK     |
|Data Scientist|UK     |
|Marketing     |USA    |
|Data Scientist|USA    |
|CEO           |USA    |
|Data Engineer |USA    |
+--------------+-------+

Distinct value of a column in pyspark using dropDuplicates()

The dropDuplicates() function also makes it possible to retrieve the distinct values of one or more columns of a Pyspark Dataframe.

To use this function, you need to do the following:

# dropDuplicates() single column 
df.dropDuplicates((['Job'])).select("Job").show(truncate=False)
+--------------+
|Job           |
+--------------+
|CEO           |
|Data Scientist|
|Marketing     |
|Data Engineer |
+--------------+

With multiple columns this gives :

# dropDuplicates() multiple column 

df.dropDuplicates((['Job','Country'])).select("Job","Country").show(truncate=False)

the function returns a new dataframe with the duplicate rows removed.

Conclusion

These functions can be very useful when we want to delete rows that contain exactly the same data. I still advise you to check before doing this kind of thing to avoid making unwanted mistakes.

I hope that this tutorial has helped you better understand these 2 functions. Don’t hesitate to share in comments if something is blocking you in the use of these methods.

If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

See you soon for new tutorials ! 🙂

Back to the python section

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.