Python

Pyspark orderBy() and sort() Function

By ayed_amira , on 09/09/2020 , updated on 09/10/2020 - 6 minutes to read
pyspark orderby and sort function

PySpark orderBy : In this tutorial we will see how to sort a Pyspark dataframe in ascending or descending order.

Introduction

To sort a dataframe in pyspark, we can use 3 methods: orderby(), sort() or with a SQL query.

This tutorial is divided into several parts:

  • Sort the dataframe in pyspark by single column (by ascending or descending order) using the orderBy() function.
  • Sort the dataframe in pyspark by mutiple columns (by ascending or descending order) using the orderBy() function.
  • Sort the dataframe in pyspark by single column (by ascending or descending order) using the sort() function.
  • Sort the dataframe in pyspark by mutiple columns (by ascending or descending order) using the sort() function.
  • Sorting the dataframe in pyspark using Raw SQL syntax

To illustrate the examples, we will use the dataframe we discussed in a previous article on the groupby() function in pyspark :

Without further ado, let’s get started!

Sort the Pyspark dataframe by single column using orderBy()

By taking again the dataframe evoked previously, we see that the columns are not sorted by increasing or decreasing order.

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
 
datavengers = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Xavier","Marketing","USA",100000,11),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("T'challa","CEO","USA",300000,20),
    ("Jean","Data Scientist","UK",220000,30),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("Wade","Marketing","UK",60000,2)
]

schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol   |Data Scientist|USA    |70000 |5        |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Xavier  |Marketing     |USA    |100000|11       |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|T'challa|CEO           |USA    |300000|20       |
|Jean    |Data Scientist|UK     |220000|30       |
|Thanos  |Data Engineer |USA    |115000|13       |
|Scott   |Data Engineer |UK     |180000|15       |
|Wade    |Marketing     |UK     |60000 |2        |
+--------+--------------+-------+------+---------+

Sort by Ascending

Here we would like to sort the different jobs in ascending order.

# By Ascending order

df.orderBy(f.col("Job").asc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO           |USA    |300000|20       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Scott   |Data Engineer |UK     |180000|15       |
|Thanos  |Data Engineer |USA    |115000|13       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Clark   |Data Scientist|UK     |111000|10       |
|Peter   |Data Scientist|USA    |90000 |7        |
|Jean    |Data Scientist|UK     |220000|30       |
|Xavier  |Marketing     |USA    |100000|11       |
|Wade    |Marketing     |UK     |60000 |2        |
+--------+--------------+-------+------+---------+

By default, the orderBy() function sorts in ascending order, we as not obliged to use the asc() function to do so.

Sort by Descending

To sort the Job column in descending order, we can use the desc() function:

# By Descending order 

df.orderBy(f.col("Job").desc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Xavier  |Marketing     |USA    |100000|11       |
|Wade    |Marketing     |UK     |60000 |2        |
|Jean    |Data Scientist|UK     |220000|30       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|Scott   |Data Engineer |UK     |180000|15       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Thanos  |Data Engineer |USA    |115000|13       |
|T'challa|CEO           |USA    |300000|20       |
+--------+--------------+-------+------+---------+

Sort the Pyspark dataframe by multiple column using orderBy()

It is possible with the function to sort several columns.

# Sort multiple column

df.orderBy(f.col("Job").asc(),f.col("Country").asc()).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO           |USA    |300000|20       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Scott   |Data Engineer |UK     |180000|15       |
|Thanos  |Data Engineer |USA    |115000|13       |
|Clark   |Data Scientist|UK     |111000|10       |
|Jean    |Data Scientist|UK     |220000|30       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Peter   |Data Scientist|USA    |90000 |7        |
|Wade    |Marketing     |UK     |60000 |2        |
|Xavier  |Marketing     |USA    |100000|11       |
+--------+--------------+-------+------+---------+

You can also sort by descending order by replacing the asc() function with desc().

Sort the Pyspark dataframe by single column using sort()

the sort() function does the same thing as the orderBy() function. To sort on a single column you have to use the following syntax:

# Sort by ascending order 

df.sort(f.col("Job")).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO           |USA    |300000|20       |
|Scott   |Data Engineer |UK     |180000|15       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Thanos  |Data Engineer |USA    |115000|13       |
|Jean    |Data Scientist|UK     |220000|30       |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Wade    |Marketing     |UK     |60000 |2        |
|Xavier  |Marketing     |USA    |100000|11       |
+--------+--------------+-------+------+---------+

You can also use the desc() function to sort in descending order

Sort the Pyspark dataframe by multiple column using sort()

As for the orderBy() function, it is possible to sort on multiple columns :

# Sort multiple using sort() function

df.sort(f.col("Job"),f.col("Country")).show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO           |USA    |300000|20       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Scott   |Data Engineer |UK     |180000|15       |
|Thanos  |Data Engineer |USA    |115000|13       |
|Clark   |Data Scientist|UK     |111000|10       |
|Jean    |Data Scientist|UK     |220000|30       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Peter   |Data Scientist|USA    |90000 |7        |
|Wade    |Marketing     |UK     |60000 |2        |
|Xavier  |Marketing     |USA    |100000|11       |

You can also use the desc() function to sort in descending order

Sort the Pyspark Dataframe using Using Raw SQL

Spark SQL also gives us the ability to use SQL syntax to sort our dataframe. To do this we need to create a temporary table so that we can perform our SQL query:

#  Raw SQL 

df.createOrReplaceTempView("df")
spark.sql("select Name,Job,Country,salary,seniority from df ORDER BY Job asc").show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|T'challa|CEO           |USA    |300000|20       |
|Scott   |Data Engineer |UK     |180000|15       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Thanos  |Data Engineer |USA    |115000|13       |
|Jean    |Data Scientist|UK     |220000|30       |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|Carol   |Data Scientist|USA    |70000 |5        |
|Wade    |Marketing     |UK     |60000 |2        |
|Xavier  |Marketing     |USA    |100000|11       |
+--------+--------------+-------+------+---------+

Conclusion

We have seen in this article that there are several methods to sort a Pyspark Dataframe. I let you choose which one you prefer to use :).

If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

Back to the python section

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.