Pyspark join Multiple dataframes

Overview

PySpark is a good python library to perform large-scale exploratory data analysis, create machine learning pipelines and create ETLs for a data platform. If you already have an intermediate level in Python and libraries such as Pandas, then PySpark is an excellent language to learn to create more scalable and relevant analyses and pipelines. In this article, we will see how PySpark’s join function is similar to SQL join, where two or more tables or data frames can be combined depending on the conditions.
If you are looking for a good learning book on pyspark click here
How to install spark locally in python ?
- Install Python
If you don’t have python installed on your machine, it is preferable that you install it via anaconda. Il est disponible à cette adresse :
https://www.anaconda.com/distribution/
- Download Spark
Spark is an open source project under the Apache Software Foundation. You can download it directly from the official Apache website:
https://spark.apache.org/downloads.html
- Install pyspark
Then, in order to install spark, we’re going to have to install Pip. Pip is a package management system used to install and manage python packages for you. After you have successfully installed python, go to the link below and install pip.
https://pip.pypa.io/en/stable/installing/
You will then have to execute the following command to be able to install spark on your machine:
$ pip install pyspark
- Change the execution path for pyspark
The last step is to modify your execution path so that your machine can execute and find the path where spark is installed:
export SPARK_HOME="/your/home/directory/spark/python"
export PATH="$SPARK_HOME/bin:$PATH"
You can now run spark by typing this:
pyspark
Pyspark join
There are a multitude of joints available on Pyspark. To test them we will create two dataframes to illustrate our examples :
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
dataset1 = [
{
'id' : '1',
'name' : 'SMITH',
'firstname' : 'LEA'
},
{
'id' : '2',
'name' : 'JOHN',
'firstname' : 'PAUL'
},
{
'id' : '4',
'name' : 'CRUZ',
'firstname' : 'TOM'
}
]
dataset2 = [
{
'id' : '1',
'Profession' : 'Lawyer',
'Salary' : 60000
},
{
'id' : '3',
'Profession' : 'Teacher',
'Salary' : 30000
},
{
'id' : '4',
'Profession' : 'Actor',
'Salary' : 20000000
}
]
rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
print('df1')
df1.show()
df1
+---------+---+------+
|firstname| id| name|
+---------+---+------+
| LEA| 1| SMITH|
| PAUL| 2| JOHN|
| TOM| 4|CRUISE|
+---------+---+------+
rdd2 = sc.parallelize(dataset2)
df2 = spark.createDataFrame(rdd2)
print('df2')
df2.show()
df2
+----------+--------+---+
|Profession| Salary| id|
+----------+--------+---+
| Lawyer| 60000| 1|
| Teacher| 30000| 3|
| Actor|20000000| 4|
+----------+--------+---+
The following kinds of joins are explained in this article.
- Inner Join.
- Outer Join.
- Left Join.
- Right Join.
- Left Semi Join.
- Left Anti Join.
- Cross join
Spark Inner join

In Pyspark, the INNER JOIN function is a very common type of join to link several tables together. This command returns records when there is at least one row in each column that matches the condition.
The syntax below states that records in dataframe df1 and df2 must be selected when the data in the “ID” column of df1 is equal to the data in the “ID” column of df2.
df = df1.join(df2, on=['id'], how='inner')
df.show()
+---+---------+------+----------+--------+
| id|firstname| name|Profession| Salary|
+---+---------+------+----------+--------+
| 1| LEA| SMITH| Lawyer| 60000|
| 4| TOM|CRUISE| Actor|20000000|
+---+---------+------+----------+--------+
In addition, PySpark provides conditions that can be specified instead of the ‘on’ parameter. Here are some examples without using the “on” parameter :
df = df1.join(df2, df1.id == df2.id, how='inner')
df.show()
+---------+---+------+----------+--------+---+
|firstname| id| name|Profession| Salary| id|
+---------+---+------+----------+--------+---+
| LEA| 1| SMITH| Lawyer| 60000| 1|
| TOM| 4|CRUISE| Actor|20000000| 4|
+---------+---+------+----------+--------+---+
df = df1.join(df2, df1.id > df2.id, how='inner')
df.show()
+---------+---+------+----------+------+---+
|firstname| id| name|Profession|Salary| id|
+---------+---+------+----------+------+---+
| PAUL| 2| JOHN| Lawyer| 60000| 1|
| TOM| 4|CRUISE| Lawyer| 60000| 1|
| TOM| 4|CRUISE| Teacher| 30000| 3|
+---------+---+------+----------+------+---+
Spark Outer Join

The outer join combines data from both databases, whether or not the “on” column matches. If a match is combined, a row is created if there is no match; missing columns for that row are filled with null.
df = df1.join(df2, on=['id'], how='outer')
df.show()
+---+---------+------+----------+--------+
| id|firstname| name|Profession| Salary|
+---+---------+------+----------+--------+
| 3| null| null| Teacher| 30000|
| 1| LEA| SMITH| Lawyer| 60000|
| 4| TOM|CRUISE| Actor|20000000|
| 2| PAUL| JOHN| null| null|
+---+---------+------+----------+--------+
Spark Left Join

LEFT JOIN is a type of join between 2 tables. It allows to list all results of the left table (left = left) even if there is no match in the second table. This join is particularly interesting for retrieving information from df1 while retrieving associated data, even if there is no match with df2. Namely, if there is no match the columns of df2 will all be null.
df = df1.join(df2, on=['id'], how='left')
df.show()
+---+---------+------+----------+--------+
| id|firstname| name|Profession| Salary|
+---+---------+------+----------+--------+
| 1| LEA| SMITH| Lawyer| 60000|
| 4| TOM|CRUISE| Actor|20000000|
| 2| PAUL| JOHN| null| null|
+---+---------+------+----------+--------+
Spark Right Join

This is the same as the left join operation performed on right side dataframe, i.e df2 in this example.
df = df1.join(df2, on=['id'], how='right')
df.show()
+---+---------+------+----------+--------+
| id|firstname| name|Profession| Salary|
+---+---------+------+----------+--------+
| 3| null| null| Teacher| 30000|
| 1| LEA| SMITH| Lawyer| 60000|
| 4| TOM|CRUISE| Actor|20000000|
+---+---------+------+----------+--------+
Spark Left Semi Join
When the left semi join is used, all rows in the left dataset that match in the right dataset are returned in the final result. However, unlike the left outer join, the result does not contain merged data from the two datasets. It contains only the columns brought by the left dataset.
df = df1.join(df2, on=['id'], how='left_semi')
df.show()
+---+---------+------+
| id|firstname| name|
+---+---------+------+
| 1| LEA| SMITH|
| 4| TOM|CRUISE|
+---+---------+------+
Spark Left Anti Join
This join is like df1-df2, as it selects all rows from df1 that are not present in df2.
df = df1.join(df2, on=['id'], how='left_anti')
df.show()
+---+---------+----+
| id|firstname|name|
+---+---------+----+
| 2| PAUL|JOHN|
+---+---------+----+
Spark Cross Joins
The last type of join we can execute is a cross join, also known as a cartesian join. Cross joins are a bit different from the other types of joins, thus cross joins get their very own DataFrame method:
df = df1.crossJoin(df2)
df.show()
+---------+---+------+----------+--------+---+
|firstname| id| name|Profession| Salary| id|
+---------+---+------+----------+--------+---+
| LEA| 1| SMITH| Lawyer| 60000| 1|
| LEA| 1| SMITH| Teacher| 30000| 3|
| LEA| 1| SMITH| Actor|20000000| 4|
| PAUL| 2| JOHN| Lawyer| 60000| 1|
| PAUL| 2| JOHN| Teacher| 30000| 3|
| PAUL| 2| JOHN| Actor|20000000| 4|
| TOM| 4|CRUISE| Lawyer| 60000| 1|
| TOM| 4|CRUISE| Teacher| 30000| 3|
| TOM| 4|CRUISE| Actor|20000000| 4|
+---------+---+------+----------+--------+---+
For more precise information about Pyspark, I invite you to visit the official website :
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join
I hope this article gives you a better understanding of the different Pyspark joints. Feel free to leave a comment if you liked the content! 🙂
If you want to learn more about spark, you can read this book :


Comments
On 09/22/2020 at 22 h 12 min, misnomer said:
This article describes multiple ways to join dataframes. The name suggests it's about joining multiple dataframes simultaneously.
Leave a comment