Python

Pyspark Drop Column – Delete a Column from Dataframe

By ayed_amira , on 11/13/2020 , updated on 11/13/2020 - 5 minutes to read
pyspark drop column

PySpark Drop Column : In this tutorial, I will learn how to delete a column or several columns from a Dataframe pyspark.

Introduction

When working on raw data, it is often necessary to delete columns that are not useful for our analysis or our learning machine model. To delete a column, Pyspark provides a method called drop(). It allows you to delete one or more columns from your Pyspark Dataframe.

We will see the following points in the rest of the tutorial :

  • Drop single column
  • Drop multiple column
  • Drop a column that contains a specific string in its name.
  • Drop a column that contains NA/Nan/Null values

To illustrate these different points, we will use the following pyspark dataframe:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - drop column example').getOrCreate()
sc = spark.sparkContext
  
pokemon = [
    ("Bulbasaur","Grass","Poison",1),
    ("Ivysaur","Grass","Poison",2),
    ("Venusaur","Grass","Poison",3),
    ("Charmeleon","Fire","Fire",5),
    ("Charizard","Fire","Flying",6),
    ("Wartortle","Water","Water",8),
    ("Blastoise","Water",None,9)
]

schema = StructType([
        
         StructField('Name', StringType(), True),
         StructField('Primary_Type', StringType(), True),
         StructField('Secondary_Type', StringType(), True),
         StructField('Index', StringType(), True)
         ])
 
df = spark.createDataFrame(data=pokemon, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Name: string (nullable = true)
 |-- Primary_Type: string (nullable = true)
 |-- Secondary_Type: string (nullable = true)
 |-- Index: string (nullable = true)

+----------+------------+--------------+-----+
|Name      |Primary_Type|Secondary_Type|Index|
+----------+------------+--------------+-----+
|Bulbasaur |Grass       |Poison        |1    |
|Ivysaur   |Grass       |Poison        |2    |
|Venusaur  |Grass       |Poison        |3    |
|Charmeleon|Fire        |Fire          |5    |
|Charizard |Fire        |Flying        |6    |
|Wartortle |Water       |Water         |8    |
|Blastoise |Water       |null          |9    |
+----------+------------+--------------+-----+

The Pyspark dataframe contains 4 columns :

  • Name: The name of the pokemon
  • Primary_type : The primary type of the pokemon
  • Secondary_Type : The secondary type of the pokemon
  • Index : The position in the pokedex

Drop single column in pyspark

To drop a single column from dataframe we can use the drop() function. It takes an argument that corresponds to the name of the column to be deleted:

Drop a single column 

df.drop(df.Primary_Type).show()

It is also possible to specify only the name of the column as argument :

Drop a single column 

df.drop("Primary_Type").show()

With these two methods, we get the following dataframe:

+----------+--------------+-----+
|      Name|Secondary_Type|Index|
+----------+--------------+-----+
| Bulbasaur|        Poison|    1|
|   Ivysaur|        Poison|    2|
|  Venusaur|        Poison|    3|
|Charmeleon|          Fire|    5|
| Charizard|        Flying|    6|
| Wartortle|         Water|    8|
| Blastoise|          null|    9|
+----------+--------------+-----+

Drop multiple column in pyspark

The drop() function allows you to delete several columns directly. There are two ways to do this.

The first method consists in listing the columns directly as argument of the function :

Drop multiple columns

df.drop("Primary_Type","Secondary_Type").show()

The second method consists of using a list that contains the columns you want to delete and passing it as an argument to the drop() function:

Drop multiple columns

columnsDrop = ["Primary_Type","Secondary_Type"]

df.drop(*columnsDrop).show()

The result is as follows:

+----------+-----+
|      Name|Index|
+----------+-----+
| Bulbasaur|    1|
|   Ivysaur|    2|
|  Venusaur|    3|
|Charmeleon|    5|
| Charizard|    6|
| Wartortle|    8|
| Blastoise|    9|
+----------+-----+

Drop a column that contains a specific string

In some cases it can be interesting to delete columns according to a sub-string contained in the column name. We will use the function df.columns ( which allows to list all the column names of our dataframe) and the function __contains__() which allows to check if the column names contain the substring specified in the argument of the function.

Here is how to proceed:

Drop mutiple column with a specific regex

columnsDrop = [i for i in df.columns if i.__contains__('Type')]
df.drop(*columnsDrop).show()
+----------+-----+
|      Name|Index|
+----------+-----+
| Bulbasaur|    1|
|   Ivysaur|    2|
|  Venusaur|    3|
|Charmeleon|    5|
| Charizard|    6|
| Wartortle|    8|
| Blastoise|    9|
+----------+-----+

Drop the columns that contains Null values

To remove columns that contain null values, we can count the number of null values for each column using the isNull() function. If this value is greater than 0 then we need to delete the column. Here is the code to handle this case:

Drop the columns that contains Null values

countNullValues = df.select([f.count(f.when(f.col(a).isNull(), a)).alias(a) for a in df.columns]).collect()[0].asDict()
columnDrop = [d for d, n in countNullValues.items() if n > 0]
df.drop(*columnDrop).show()

Here is the result of the previous lines of code:

+----------+------------+-----+
|      Name|Primary_Type|Index|
+----------+------------+-----+
| Bulbasaur|       Grass|    1|
|   Ivysaur|       Grass|    2|
|  Venusaur|       Grass|    3|
|Charmeleon|        Fire|    5|
| Charizard|        Fire|    6|
| Wartortle|       Water|    8|
| Blastoise|       Water|    9|
+----------+------------+-----+

We can see that the Secondary_Type column has been deleted because it contained a Null value.

Conclusion

In this tutorial we learned how to delete a column in a dataframe pyspark. This is very useful when you want to clean your dataframe with useless columns. I hope you enjoyed this content! Feel free to share it with me as a comment. If you have any questions about how the drop() function works, I will be happy to answer them.

Thanks for reading and see you soon!

Back to the python section

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.