PySpark Groupby : Use the Groupby() to Aggregate data

pyspark groupby

PySpark Groupby : We will see in this tutorial how to aggregate data with the Groupby function present in Spark.

Introduction

PySpark’s groupBy() function is used to aggregate identical data from a dataframe and then combine with aggregation functions.

There are a multitude of aggregation functions that can be combined with a group by :

  • count(): It returns the number of rows for each of the groups from group by.
  • sum() : It returns the total number of values of each group.
  • max() – Returns the maximum number of values for each group.
  • min() – Returns the minimum value of values for each group.
  • mean(): Returns the average of the values for each group.
  • avg() – Returns the average of the values for each group.
  • agg() – Allows you to calculate several aggregates at once (which can be very useful).
  • pivot(): The pivot() function is used to rotate the data of a DataFrame column into several columns (which is used to transform rows into columns).

In this article we will review the different ways to make a group by on a Pyspark dataframe using the different aggregation functions.

If you have to install spark, I invite you to have a look at one of my previous articles which explains the installation step by step:

If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :

Creating Dataframe

To illustrate the various examples of aggregation functions, we will first create a Dataframe Pyspark :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
 
datavengers = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("Jean","Data Scientist","UK",220000,30),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("T'challa","CEO","USA",300000,20),
    ("Xavier","Marketing","USA",100000,11),
    ("Wade","Marketing","UK",60000,2)
]

schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
root
 |-- Name: string (nullable = true)
 |-- Job: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- seniority: long (nullable = true)

Thanks to the printschema(), we can see that our dataframe contains 5 columns:

  • Name: Name of the employee
  • Job: The work the person does
  • Country: The country where the employee works
  • salary: annual salary
  • seniority: seniority in the company’s workforce

To display what the dataframe contains, you can use the show() function.

df.show(truncate=False)
+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol   |Data Scientist|USA    |70000 |5        |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|Jean    |Data Scientist|UK     |220000|30       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Thanos  |Data Engineer |USA    |115000|13       |
|Scott   |Data Engineer |UK     |180000|15       |
|T'challa|CEO           |USA    |300000|20       |
|Xavier  |Marketing     |USA    |100000|11       |
|Wade    |Marketing     |UK     |60000 |2        |
+--------+--------------+-------+------+---------+

PySpark groupBy and aggregation functions on DataFrame columns

We will use the groupby() function on the “Job” column of our previously created dataframe and test the different aggregations.

Pyspark groupBy using count() function

To count the number of employees per job type, you can proceed like this:

# count() function

df.groupBy("Job").count().show(truncate=False)
+--------------+-----+
|Job           |count|
+--------------+-----+
|CEO           |1    |
|Data Scientist|4    |
|Marketing     |2    |
|Data Engineer |3    |
+--------------+-----+

Pyspark groupBy using sum() function

Here is the syntax for summing salaries by job type :

# sum() function

df.groupBy("Job").sum("salary").show(truncate=False)
+--------------+-----------+
|Job           |sum(salary)|
+--------------+-----------+
|CEO           |300000     |
|Data Scientist|491000     |
|Marketing     |160000     |
|Data Engineer |375000     |
+--------------+-----------+

Pyspark groupBy using max() function

To retrieve the maximum salary for each type of job :

# sum() function

df.groupBy("Job").max("salary").show(truncate=False)
+--------------+-----------+
|Job           |max(salary)|
+--------------+-----------+
|CEO           |300000     |
|Data Scientist|220000     |
|Marketing     |100000     |
|Data Engineer |180000     |
+--------------+-----------+

Pyspark groupBy using min() function

To retrieve the minimum salary for each type of job :

# sum() function

df.groupBy("Job").min("salary").show(truncate=False)
+--------------+-----------+
|Job           |min(salary)|
+--------------+-----------+
|CEO           |300000     |
|Data Scientist|70000      |
|Marketing     |60000      |
|Data Engineer |80000      |
+--------------+-----------+

Pyspark groupBy using avg() function

To calculate the average salary for each type of job :

# sum() function

df.groupBy("Job").avg("salary").show(truncate=False)
+--------------+-----------+
|Job           |avg(salary)|
+--------------+-----------+
|CEO           |300000.0   |
|Data Scientist|122750.0   |
|Marketing     |80000.0    |
|Data Engineer |125000.0   |
+--------------+-----------+

PySpark groupBy and aggregation functions on DataFrame multiple columns

For some calculations, you will need to aggregate your data on several columns of your dataframe. The example below shows you how to aggregate on more than one column:

# groupBy on multiple columns

df.groupBy("Job","Country") \
    .avg("salary","seniority") \
    .show(truncate=False)
+--------------+-------+-----------+--------------+
|Job           |Country|avg(salary)|avg(seniority)|
+--------------+-------+-----------+--------------+
|Marketing     |UK     |60000.0    |2.0           |
|Data Engineer |UK     |130000.0   |9.5           |
|Data Scientist|UK     |165500.0   |20.0          |
|Marketing     |USA    |100000.0   |11.0          |
|Data Scientist|USA    |80000.0    |6.0           |
|CEO           |USA    |300000.0   |20.0          |
|Data Engineer |USA    |115000.0   |13.0          |
+--------------+-------+-----------+--------------+

As you can see, we have filled in two columns in the groupBy function (Job and Country).

Execute several aggregation types simultaneously

Thanks to the agg() function, we can calculate several aggregates directly in a groupBy :

# agg() function

df.groupBy("Job") \
    .agg(f.sum("salary").alias("sum_salary"), \
         f.avg("salary").alias("avg_salary"), \
         f.min("salary").alias("min_salary"), \
         f.max("salary").alias("max_salary"), \
         f.mean("salary").alias("mean_salary") \
     ) \
    .show(truncate=False)
+--------------+----------+----------+----------+----------+-----------+
|Job           |sum_salary|avg_salary|min_salary|max_salary|mean_salary|
+--------------+----------+----------+----------+----------+-----------+
|CEO           |300000    |300000.0  |300000    |300000    |300000.0   |
|Data Scientist|491000    |122750.0  |70000     |220000    |122750.0   |
|Marketing     |160000    |80000.0   |60000     |100000    |80000.0    |
|Data Engineer |375000    |125000.0  |80000     |180000    |125000.0   |
+--------------+----------+----------+----------+----------+-----------+

Conclusion

In this tutorial we have learned how to use the groupBy() function and the aggregation functions on a Pyspark Dataframe. I hope this has helped you to understand how these aggregations work. Feel free to leave me a comment, I would be happy to answer it 🙂

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

1 comment

Leave a comment

Your email address will not be published.