Pyspark orderBy() and sort() Function

PySpark orderBy : In this tutorial we will see how to sort a Pyspark dataframe in ascending or descending order.
Introduction
To sort a dataframe in pyspark, we can use 3 methods: orderby(), sort() or with a SQL query.
This tutorial is divided into several parts:
- Sort the dataframe in pyspark by single column (by ascending or descending order) using the orderBy() function.
- Sort the dataframe in pyspark by mutiple columns (by ascending or descending order) using the orderBy() function.
- Sort the dataframe in pyspark by single column (by ascending or descending order) using the sort() function.
- Sort the dataframe in pyspark by mutiple columns (by ascending or descending order) using the sort() function.
- Sorting the dataframe in pyspark using Raw SQL syntax
To illustrate the examples, we will use the dataframe we discussed in a previous article on the groupby() function in pyspark :
Without further ado, let’s get started!
Sort the Pyspark dataframe by single column using orderBy()
By taking again the dataframe evoked previously, we see that the columns are not sorted by increasing or decreasing order.
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
datavengers = [
("Carol","Data Scientist","USA",70000,5),
("Bruce","Data Engineer","UK",80000,4),
("Xavier","Marketing","USA",100000,11),
("Peter","Data Scientist","USA",90000,7),
("Clark","Data Scientist","UK",111000,10),
("T'challa","CEO","USA",300000,20),
("Jean","Data Scientist","UK",220000,30),
("Thanos","Data Engineer","USA",115000,13),
("Scott","Data Engineer","UK",180000,15),
("Wade","Marketing","UK",60000,2)
]
schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol |Data Scientist|USA |70000 |5 |
|Bruce |Data Engineer |UK |80000 |4 |
|Xavier |Marketing |USA |100000|11 |
|Peter |Data Scientist|USA |90000 |7 |
|Clark |Data Scientist|UK |111000|10 |
|T'challa|CEO |USA |300000|20 |
|Jean |Data Scientist|UK |220000|30 |
|Thanos |Data Engineer |USA |115000|13 |
|Scott |Data Engineer |UK |180000|15 |
|Wade |Marketing |UK |60000 |2 |
+--------+--------------+-------+------+---------+
Sort by Ascending
Here we would like to sort the different jobs in ascending order.
# By Ascending order
df.orderBy(f.col("Job").asc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO |USA |300000|20 |
|Bruce |Data Engineer |UK |80000 |4 |
|Scott |Data Engineer |UK |180000|15 |
|Thanos |Data Engineer |USA |115000|13 |
|Carol |Data Scientist|USA |70000 |5 |
|Clark |Data Scientist|UK |111000|10 |
|Peter |Data Scientist|USA |90000 |7 |
|Jean |Data Scientist|UK |220000|30 |
|Xavier |Marketing |USA |100000|11 |
|Wade |Marketing |UK |60000 |2 |
+--------+--------------+-------+------+---------+
By default, the orderBy() function sorts in ascending order, we as not obliged to use the asc() function to do so.
Sort by Descending
To sort the Job column in descending order, we can use the desc() function:
# By Descending order
df.orderBy(f.col("Job").desc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Xavier |Marketing |USA |100000|11 |
|Wade |Marketing |UK |60000 |2 |
|Jean |Data Scientist|UK |220000|30 |
|Carol |Data Scientist|USA |70000 |5 |
|Peter |Data Scientist|USA |90000 |7 |
|Clark |Data Scientist|UK |111000|10 |
|Scott |Data Engineer |UK |180000|15 |
|Bruce |Data Engineer |UK |80000 |4 |
|Thanos |Data Engineer |USA |115000|13 |
|T'challa|CEO |USA |300000|20 |
+--------+--------------+-------+------+---------+
Sort the Pyspark dataframe by multiple column using orderBy()
It is possible with the function to sort several columns.
# Sort multiple column
df.orderBy(f.col("Job").asc(),f.col("Country").asc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO |USA |300000|20 |
|Bruce |Data Engineer |UK |80000 |4 |
|Scott |Data Engineer |UK |180000|15 |
|Thanos |Data Engineer |USA |115000|13 |
|Clark |Data Scientist|UK |111000|10 |
|Jean |Data Scientist|UK |220000|30 |
|Carol |Data Scientist|USA |70000 |5 |
|Peter |Data Scientist|USA |90000 |7 |
|Wade |Marketing |UK |60000 |2 |
|Xavier |Marketing |USA |100000|11 |
+--------+--------------+-------+------+---------+
You can also sort by descending order by replacing the asc() function with desc().
Sort the Pyspark dataframe by single column using sort()
the sort() function does the same thing as the orderBy() function. To sort on a single column you have to use the following syntax:
# Sort by ascending order
df.sort(f.col("Job")).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO |USA |300000|20 |
|Scott |Data Engineer |UK |180000|15 |
|Bruce |Data Engineer |UK |80000 |4 |
|Thanos |Data Engineer |USA |115000|13 |
|Jean |Data Scientist|UK |220000|30 |
|Peter |Data Scientist|USA |90000 |7 |
|Clark |Data Scientist|UK |111000|10 |
|Carol |Data Scientist|USA |70000 |5 |
|Wade |Marketing |UK |60000 |2 |
|Xavier |Marketing |USA |100000|11 |
+--------+--------------+-------+------+---------+
You can also use the desc() function to sort in descending order
Sort the Pyspark dataframe by multiple column using sort()
As for the orderBy() function, it is possible to sort on multiple columns :
# Sort multiple using sort() function
df.sort(f.col("Job"),f.col("Country")).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO |USA |300000|20 |
|Bruce |Data Engineer |UK |80000 |4 |
|Scott |Data Engineer |UK |180000|15 |
|Thanos |Data Engineer |USA |115000|13 |
|Clark |Data Scientist|UK |111000|10 |
|Jean |Data Scientist|UK |220000|30 |
|Carol |Data Scientist|USA |70000 |5 |
|Peter |Data Scientist|USA |90000 |7 |
|Wade |Marketing |UK |60000 |2 |
|Xavier |Marketing |USA |100000|11 |
You can also use the desc() function to sort in descending order
Sort the Pyspark Dataframe using Using Raw SQL
Spark SQL also gives us the ability to use SQL syntax to sort our dataframe. To do this we need to create a temporary table so that we can perform our SQL query:
# Raw SQL
df.createOrReplaceTempView("df")
spark.sql("select Name,Job,Country,salary,seniority from df ORDER BY Job asc").show(truncate=False)
+--------+--------------+-------+------+---------+
|Name |Job |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO |USA |300000|20 |
|Scott |Data Engineer |UK |180000|15 |
|Bruce |Data Engineer |UK |80000 |4 |
|Thanos |Data Engineer |USA |115000|13 |
|Jean |Data Scientist|UK |220000|30 |
|Peter |Data Scientist|USA |90000 |7 |
|Clark |Data Scientist|UK |111000|10 |
|Carol |Data Scientist|USA |70000 |5 |
|Wade |Marketing |UK |60000 |2 |
|Xavier |Marketing |USA |100000|11 |
+--------+--------------+-------+------+---------+
Conclusion
We have seen in this article that there are several methods to sort a Pyspark Dataframe. I let you choose which one you prefer to use :).
If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :
Comments
Leave a comment