Pyspark Drop Column – Delete a Column from Dataframe

PySpark Drop Column : In this tutorial, I will learn how to delete a column or several columns from a Dataframe pyspark.
Introduction
When working on raw data, it is often necessary to delete columns that are not useful for our analysis or our learning machine model. To delete a column, Pyspark provides a method called drop(). It allows you to delete one or more columns from your Pyspark Dataframe.
We will see the following points in the rest of the tutorial :
- Drop single column
- Drop multiple column
- Drop a column that contains a specific string in its name.
- Drop a column that contains NA/Nan/Null values
To illustrate these different points, we will use the following pyspark dataframe:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - drop column example').getOrCreate()
sc = spark.sparkContext
pokemon = [
("Bulbasaur","Grass","Poison",1),
("Ivysaur","Grass","Poison",2),
("Venusaur","Grass","Poison",3),
("Charmeleon","Fire","Fire",5),
("Charizard","Fire","Flying",6),
("Wartortle","Water","Water",8),
("Blastoise","Water",None,9)
]
schema = StructType([
StructField('Name', StringType(), True),
StructField('Primary_Type', StringType(), True),
StructField('Secondary_Type', StringType(), True),
StructField('Index', StringType(), True)
])
df = spark.createDataFrame(data=pokemon, schema = schema)
df.printSchema()
df.show(truncate=False)
root
|-- Name: string (nullable = true)
|-- Primary_Type: string (nullable = true)
|-- Secondary_Type: string (nullable = true)
|-- Index: string (nullable = true)
+----------+------------+--------------+-----+
|Name |Primary_Type|Secondary_Type|Index|
+----------+------------+--------------+-----+
|Bulbasaur |Grass |Poison |1 |
|Ivysaur |Grass |Poison |2 |
|Venusaur |Grass |Poison |3 |
|Charmeleon|Fire |Fire |5 |
|Charizard |Fire |Flying |6 |
|Wartortle |Water |Water |8 |
|Blastoise |Water |null |9 |
+----------+------------+--------------+-----+
The Pyspark dataframe contains 4 columns :
- Name: The name of the pokemon
- Primary_type : The primary type of the pokemon
- Secondary_Type : The secondary type of the pokemon
- Index : The position in the pokedex
Drop single column in pyspark
To drop a single column from dataframe we can use the drop() function. It takes an argument that corresponds to the name of the column to be deleted:
Drop a single column
df.drop(df.Primary_Type).show()
It is also possible to specify only the name of the column as argument :
Drop a single column
df.drop("Primary_Type").show()
With these two methods, we get the following dataframe:
+----------+--------------+-----+
| Name|Secondary_Type|Index|
+----------+--------------+-----+
| Bulbasaur| Poison| 1|
| Ivysaur| Poison| 2|
| Venusaur| Poison| 3|
|Charmeleon| Fire| 5|
| Charizard| Flying| 6|
| Wartortle| Water| 8|
| Blastoise| null| 9|
+----------+--------------+-----+
Drop multiple column in pyspark
The drop() function allows you to delete several columns directly. There are two ways to do this.
The first method consists in listing the columns directly as argument of the function :
Drop multiple columns
df.drop("Primary_Type","Secondary_Type").show()
The second method consists of using a list that contains the columns you want to delete and passing it as an argument to the drop() function:
Drop multiple columns
columnsDrop = ["Primary_Type","Secondary_Type"]
df.drop(*columnsDrop).show()
The result is as follows:
+----------+-----+
| Name|Index|
+----------+-----+
| Bulbasaur| 1|
| Ivysaur| 2|
| Venusaur| 3|
|Charmeleon| 5|
| Charizard| 6|
| Wartortle| 8|
| Blastoise| 9|
+----------+-----+
Drop a column that contains a specific string
In some cases it can be interesting to delete columns according to a sub-string contained in the column name. We will use the function df.columns ( which allows to list all the column names of our dataframe) and the function __contains__() which allows to check if the column names contain the substring specified in the argument of the function.
Here is how to proceed:
Drop mutiple column with a specific regex
columnsDrop = [i for i in df.columns if i.__contains__('Type')]
df.drop(*columnsDrop).show()
+----------+-----+
| Name|Index|
+----------+-----+
| Bulbasaur| 1|
| Ivysaur| 2|
| Venusaur| 3|
|Charmeleon| 5|
| Charizard| 6|
| Wartortle| 8|
| Blastoise| 9|
+----------+-----+
Drop the columns that contains Null values
To remove columns that contain null values, we can count the number of null values for each column using the isNull() function. If this value is greater than 0 then we need to delete the column. Here is the code to handle this case:
Drop the columns that contains Null values
countNullValues = df.select([f.count(f.when(f.col(a).isNull(), a)).alias(a) for a in df.columns]).collect()[0].asDict()
columnDrop = [d for d, n in countNullValues.items() if n > 0]
df.drop(*columnDrop).show()
Here is the result of the previous lines of code:
+----------+------------+-----+
| Name|Primary_Type|Index|
+----------+------------+-----+
| Bulbasaur| Grass| 1|
| Ivysaur| Grass| 2|
| Venusaur| Grass| 3|
|Charmeleon| Fire| 5|
| Charizard| Fire| 6|
| Wartortle| Water| 8|
| Blastoise| Water| 9|
+----------+------------+-----+
We can see that the Secondary_Type column has been deleted because it contained a Null value.
Conclusion
In this tutorial we learned how to delete a column in a dataframe pyspark. This is very useful when you want to clean your dataframe with useless columns. I hope you enjoyed this content! Feel free to share it with me as a comment. If you have any questions about how the drop() function works, I will be happy to answer them.
Thanks for reading and see you soon!
Comments
Leave a comment