PySpark Rename Column on PySpark Dataframe (Single or Multiple Column)

pyspark rename column

PySpark Rename Column : In this turorial we will see how to rename one or more columns in a pyspark dataframe and the different ways to do it.

Introduction

In many occasions, it may be necessary to rename a Pyspark dataframe column. For example, when reading a file and the headers do not correspond to what you want or to export a file in a desired format.

You can see this tutorial if you want to know how to read a csv file in pyspark :

In pyspark, there are several ways to rename these columns:

  • By using the function withColumnRenamed() which allows you to rename one or more columns.
  • By using the selectExpr() function
  • Using the select() and alias() function
  • Using the toDF() function

We will see in this tutorial how to use these different functions with several examples based on this pyspark dataframe :

pyspark rename column
PySpark Dataframe Example

Here is the code to create the pyspark dataframe :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - example toPandas()').getOrCreate()
sc = spark.sparkContext
  
pokedex = [
    ("Bulbasaur",("Grass","Poison"),1),
    ("Ivysaur",("Grass","Poison"),2),
    ("Venusaur",("Grass","Poison"),3),
    ("Charmeleon",("Fire","Fire"),5),
    ("Charizard",("Fire","Flying"),6),
    ("Wartortle",("Water","Water"),8),
    ("Blastoise",("Water","Water"),9)
]

schema = StructType([
        
         StructField('Name', StringType(), True),
         StructField('Type', StructType([
             StructField('Primary', StringType(), True),
             StructField('Secondary', StringType(), True)
             ])),
         StructField('Index', StringType(), True)
         ])
 
df = spark.createDataFrame(data=pokedex, schema = schema)
df.printSchema()
df.show(truncate=False)

PySpark withColumnRenamed

PySpark withColumnRenamed – To rename a single column name

One of the simplest approaches to renaming a column is to use the withColumnRenamed function. The function takes two parameters which are :

existingCol: The name of the column you want to change.
newCol: The new column name.

Using our example dataframe, we will change the name of the “Name” column to “Pokemon_Name” :

# Rename single column using withColumnRenamed

df1 = df.withColumnRenamed("Name","Pokemon_Name")
df1.printSchema()

This gives us :

root
 |-- Pokemon_Name: string (nullable = true)
 |-- Type: struct (nullable = true)
 |    |-- Primary: string (nullable = true)
 |    |-- Secondary: string (nullable = true)
 |-- Index: string (nullable = true)

PySpark withColumnRenamed – To rename multiple column name

We can also combine several withColumnRenamed to rename several columns at once:

# Rename mutiple column using withColumnRenamed

df1 = df.withColumnRenamed("Name","Pokemon_Name").withColumnRenamed("Index","Number_id")
df1.printSchema()

PySpark withColumnRenamed – To rename nested columns

It is also possible to rename a column containing a nested array. This has the advantage of creating multiple columns for each element of our array (this can be interesting in some situations).

Here’s how to do it:

# Rename nested column using withColumnRenamed

df1 = df.withColumn("Primary_Type",f.col("Type.Primary")) \
      .withColumn("Secondary_Type",f.col("Type.Secondary")) \
      .drop("Type")
df1.printSchema()
df1.show()
root
 |-- Name: string (nullable = true)
 |-- Index: string (nullable = true)
 |-- Primary_Type: string (nullable = true)
 |-- Secondary_Type: string (nullable = true)

+----------+-----+------------+--------------+
|      Name|Index|Primary_Type|Secondary_Type|
+----------+-----+------------+--------------+
| Bulbasaur|    1|       Grass|        Poison|
|   Ivysaur|    2|       Grass|        Poison|
|  Venusaur|    3|       Grass|        Poison|
|Charmeleon|    5|        Fire|          Fire|
| Charizard|    6|        Fire|        Flying|
| Wartortle|    8|       Water|         Water|
| Blastoise|    9|       Water|         Water|
+----------+-----+------------+--------------+

Pyspark Rename Column Using selectExpr() function

Using the selectExpr() function in Pyspark, we can also rename one or more columns of our Pyspark Dataframe. We will use this function to rename the “Name” and “Index” columns respectively by “Pokemon_Name” and “Number_id” :

# Rename single or multiple colomun using selectExpr()

df1 = df.selectExpr("Name as Pokemon_Name", "Index as Number_id","Type")
df1.printSchema()
df1.show()

We use the “AS” keyword to assign a new value to our columns.

Pyspark Rename Column Using alias() function

The alias() function gives the possibility to rename one or more columns (in combination with the select function).

# Rename column using alias() function

df1 = df.select(f.col("Name").alias("Pokemon_Name"), f.col("Index").alias("Number_id"),"Type")
df1.printSchema()
root
 |-- Pokemon_Name: string (nullable = true)
 |-- Number_id: string (nullable = true)
 |-- Type: struct (nullable = true)
 |    |-- Primary: string (nullable = true)
 |    |-- Secondary: string (nullable = true)

Pyspark Rename Column Using toDF() function

The toDF() function allows to convert highly typed data of a dataframe with renamed column names. We can therefore use this function to rename the columns of our Pyspark dataframe :

# Rename column using toDF() function 

df1 = df.toDF("Pokemon_Name","Type","Number_id")
df1.printSchema()
df1.show()
root
 |-- Pokemon_Name: string (nullable = true)
 |-- Type: struct (nullable = true)
 |    |-- Primary: string (nullable = true)
 |    |-- Secondary: string (nullable = true)
 |-- Number_id: string (nullable = true)

+------------+---------------+---------+
|Pokemon_Name|           Type|Number_id|
+------------+---------------+---------+
|   Bulbasaur|[Grass, Poison]|        1|
|     Ivysaur|[Grass, Poison]|        2|
|    Venusaur|[Grass, Poison]|        3|
|  Charmeleon|   [Fire, Fire]|        5|
|   Charizard| [Fire, Flying]|        6|
|   Wartortle| [Water, Water]|        8|
|   Blastoise| [Water, Water]|        9|

Conclusion

In this article we learned the different ways to rename columns in a Pyspark Dataframe ( single or multiple columns). I hope that it helped you in using these functions, feel free to send me comments I would be happy to read them 🙂


If you wish to deepen your knowledge in Pyspark, there are excellent books on the subject, here is a list of what I consider interesting to study (As an Amazon Partner, I make a profit on qualifying purchases) :

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published. Required fields are marked *