PySpark Filter : Filter data with single or multiple conditions

pyspark filter

PySpark Filter : In this tutorial we will see how to use the filter function in pyspark

Introduction

The filter() function is widely used when you want to filter a spark dataframe. I will show you the different ways to use this function:

  • Filter data with single condition
  • Filter data with multiple conditions
  • Filter data with conditions using sql functions

If you want to install spark on your computer, I advise you to check my previous article which explains how to do it simply.

Pyspark Filter data with single condition

To begin we will create a spark dataframe that will allow us to illustrate our examples.

spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
 
dataset1 = [
  {
  'id' : '1',
  'name' : 'Bulbasaur',
  'primary_type' : 'Grass',
  'secondary_type' : 'Poison',
  'evolve':'Ivysaur'
  },
  {
  'id' : '2',
  'name' : 'Ivysaur',
  'primary_type' : 'Grass',
  'secondary_type' : 'Poison',
  'evolve':'Venusaur'
  },
  {
  'id' : '3',
  'name' : 'Venusaur',
  'primary_type' : 'Grass',
  'secondary_type' : 'Poison',
  'evolve':''
  },
  {
  'id' : '4',
  'name' : 'Charmander',
  'primary_type' : 'Fire',
  'secondary_type' : 'Fire',
  'evolve':'Charmeleon'
  },
  {
  'id' : '5',
  'name' : 'Charmeleon',
  'primary_type' : 'Fire',
  'secondary_type' : 'Fire',
  'evolve':'Charizard'
  },
  {
  'id' : '6',
  'name' : 'Charizard',
  'primary_type' : 'Fire',
  'secondary_type' : 'Flying',
  'evolve':''
  }
     
]

rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
print('df1')
print(type(df1))
df1.show()
 

Output :

df1
<class 'pyspark.sql.dataframe.DataFrame'>
+----------+---+----------+------------+--------------+
|    evolve| id|      name|primary_type|secondary_type|
+----------+---+----------+------------+--------------+
|   Ivysaur|  1| Bulbasaur|       Grass|        Poison|
|  Venusaur|  2|   Ivysaur|       Grass|        Poison|
|          |  3|  Venusaur|       Grass|        Poison|
|Charmeleon|  4|Charmander|        Fire|          Fire|
| Charizard|  5|Charmeleon|        Fire|          Fire|
|          |  6| Charizard|        Fire|        Flying|
+----------+---+----------+------------+--------------+
 

This dataframe spark contains 5 columns which are as follows:

  • id
  • name
  • primary_type
  • secondary_type
  • evolve

We will be able to use the filter function on these 5 columns if we wish to do so.

To filter on a single column, we can use the filter() function with a condition inside that function :

df1.filter(df1.primary_type == "Fire").show()
 

In this example, we have filtered on pokemons whose primary type is fire.

 
df1.filter(df1.id < 4).show()
 

In this example, we have filtered on pokemons whose ID is smaller than 4

Pyspark Filter data with multiple conditions

Multiple conditon using OR operator

It is also possible to filter on several columns by using the filter() function in combination with the OR and AND operators.

df1.filter("primary_type == 'Grass' or secondary_type == 'Flying'").show()
 

Output:

+--------+---+---------+------------+--------------+
|  evolve| id|     name|primary_type|secondary_type|
+--------+---+---------+------------+--------------+
| Ivysaur|  1|Bulbasaur|       Grass|        Poison|
|Venusaur|  2|  Ivysaur|       Grass|        Poison|
|        |  3| Venusaur|       Grass|        Poison|
|        |  6|Charizard|        Fire|        Flying|
+--------+---+---------+------------+--------------+
 

This filter allows to recover all the pokemons which have as primary type the grass OR as secondary type the flight.

Multiple conditon using AND operator

Filters with the AND operator work on the same principle as for the OR operator.

df1.filter("primary_type == 'Fire' and secondary_type == 'Flying'").show()
 

Output:

+------+---+---------+------------+--------------+
|evolve| id|     name|primary_type|secondary_type|
+------+---+---------+------------+--------------+
|      |  6|Charizard|        Fire|        Flying|
+------+---+---------+------------+--------------+
 

Pyspark Filter data with multiple conditions using Spark SQL

To filter the data, we can also use SQL Spark and the col() function present in the SQL Spark function :

## filter with multiple condition using sql.functions
 
from pyspark.sql import functions as f

df1.filter((f.col('primary_type') == 'Fire') & (f.col('secondary_type') == 'Fire')).show()
 

Output :

+----------+---+----------+------------+--------------+
|    evolve| id|      name|primary_type|secondary_type|
+----------+---+----------+------------+--------------+
|Charmeleon|  4|Charmander|        Fire|          Fire|
| Charizard|  5|Charmeleon|        Fire|          Fire|
+----------+---+----------+------------+--------------+
 

This filter allows you to get all pokemons whose primary and secondary type is fire.

Summary

As you can see, the filter() function is very easy to use and allows you to quickly filter your spark dataframe. In particular, it allows you to filter :

  • By using one or more conditions
  • Using the AND and OR operators
  • With regular expressions
  • By using other combination functions such as lower(),isin() etc…

I hope this article has given you a better understanding of the filter() function.

If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.