PySpark Filter : Filter data with single or multiple conditions

PySpark Filter : In this tutorial we will see how to use the filter function in pyspark
Introduction
The filter() function is widely used when you want to filter a spark dataframe. I will show you the different ways to use this function:
- Filter data with single condition
- Filter data with multiple conditions
- Filter data with conditions using sql functions
If you want to install spark on your computer, I advise you to check my previous article which explains how to do it simply.
Pyspark Filter data with single condition
To begin we will create a spark dataframe that will allow us to illustrate our examples.
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
dataset1 = [
{
'id' : '1',
'name' : 'Bulbasaur',
'primary_type' : 'Grass',
'secondary_type' : 'Poison',
'evolve':'Ivysaur'
},
{
'id' : '2',
'name' : 'Ivysaur',
'primary_type' : 'Grass',
'secondary_type' : 'Poison',
'evolve':'Venusaur'
},
{
'id' : '3',
'name' : 'Venusaur',
'primary_type' : 'Grass',
'secondary_type' : 'Poison',
'evolve':''
},
{
'id' : '4',
'name' : 'Charmander',
'primary_type' : 'Fire',
'secondary_type' : 'Fire',
'evolve':'Charmeleon'
},
{
'id' : '5',
'name' : 'Charmeleon',
'primary_type' : 'Fire',
'secondary_type' : 'Fire',
'evolve':'Charizard'
},
{
'id' : '6',
'name' : 'Charizard',
'primary_type' : 'Fire',
'secondary_type' : 'Flying',
'evolve':''
}
]
rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
print('df1')
print(type(df1))
df1.show()
Output :
df1
<class 'pyspark.sql.dataframe.DataFrame'>
+----------+---+----------+------------+--------------+
| evolve| id| name|primary_type|secondary_type|
+----------+---+----------+------------+--------------+
| Ivysaur| 1| Bulbasaur| Grass| Poison|
| Venusaur| 2| Ivysaur| Grass| Poison|
| | 3| Venusaur| Grass| Poison|
|Charmeleon| 4|Charmander| Fire| Fire|
| Charizard| 5|Charmeleon| Fire| Fire|
| | 6| Charizard| Fire| Flying|
+----------+---+----------+------------+--------------+
This dataframe spark contains 5 columns which are as follows:
- id
- name
- primary_type
- secondary_type
- evolve
We will be able to use the filter function on these 5 columns if we wish to do so.
To filter on a single column, we can use the filter() function with a condition inside that function :
df1.filter(df1.primary_type == "Fire").show()
In this example, we have filtered on pokemons whose primary type is fire.
df1.filter(df1.id < 4).show()
In this example, we have filtered on pokemons whose ID is smaller than 4
Pyspark Filter data with multiple conditions
Multiple conditon using OR operator
It is also possible to filter on several columns by using the filter() function in combination with the OR and AND operators.
df1.filter("primary_type == 'Grass' or secondary_type == 'Flying'").show()
Output:
+--------+---+---------+------------+--------------+
| evolve| id| name|primary_type|secondary_type|
+--------+---+---------+------------+--------------+
| Ivysaur| 1|Bulbasaur| Grass| Poison|
|Venusaur| 2| Ivysaur| Grass| Poison|
| | 3| Venusaur| Grass| Poison|
| | 6|Charizard| Fire| Flying|
+--------+---+---------+------------+--------------+
This filter allows to recover all the pokemons which have as primary type the grass OR as secondary type the flight.
Multiple conditon using AND operator
Filters with the AND operator work on the same principle as for the OR operator.
df1.filter("primary_type == 'Fire' and secondary_type == 'Flying'").show()
Output:
+------+---+---------+------------+--------------+
|evolve| id| name|primary_type|secondary_type|
+------+---+---------+------------+--------------+
| | 6|Charizard| Fire| Flying|
+------+---+---------+------------+--------------+
Pyspark Filter data with multiple conditions using Spark SQL
To filter the data, we can also use SQL Spark and the col() function present in the SQL Spark function :
## filter with multiple condition using sql.functions
from pyspark.sql import functions as f
df1.filter((f.col('primary_type') == 'Fire') & (f.col('secondary_type') == 'Fire')).show()
Output :
+----------+---+----------+------------+--------------+
| evolve| id| name|primary_type|secondary_type|
+----------+---+----------+------------+--------------+
|Charmeleon| 4|Charmander| Fire| Fire|
| Charizard| 5|Charmeleon| Fire| Fire|
+----------+---+----------+------------+--------------+
This filter allows you to get all pokemons whose primary and secondary type is fire.
Summary
As you can see, the filter() function is very easy to use and allows you to quickly filter your spark dataframe. In particular, it allows you to filter :
- By using one or more conditions
- Using the AND and OR operators
- With regular expressions
- By using other combination functions such as lower(),isin() etc…
I hope this article has given you a better understanding of the filter() function.
If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :
Comments
Leave a comment