PySpark lit() Function to Add a Literal or Constant Column to Dataframe

pyspark lit() - constant or literal column

PySpark lit() : In this tutorial we will see how to use the pyspark.sql.functions.lit() in Spark SQL.

Introduction

The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.

The syntax of the function is as follows:

# Lit function

from pyspark.sql.functions import lit
lit(col)

The function is available when importing pyspark.sql.functions. So it takes a parameter that contains our constant or literal value.

The lit() function returns a Column object.

We will use this Pyspark Dataframe to show you how to use the lit() function:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example lit').getOrCreate()
sc = spark.sparkContext
 
pokedex = [
    ("Bulbasaur","Grass",1),
    ("Ivysaur","Grass",2),
    ("Venusaur","Grass",3),
    ("Charmeleon","Fire",5),
    ("Charizard","Fire",6),
    ("Wartortle","Water",8),
    ("Blastoise","Water",9)
]

schema = ["Name","PrimaryType","Index"]
df = spark.createDataFrame(data=pokedex, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Name: string (nullable = true)
 |-- PrimaryType: string (nullable = true)
 |-- Index: long (nullable = true)

+----------+-----------+-----+
|Name      |PrimaryType|Index|
+----------+-----------+-----+
|Bulbasaur |Grass      |1    |
|Ivysaur   |Grass      |2    |
|Venusaur  |Grass      |3    |
|Charmeleon|Fire       |5    |
|Charizard |Fire       |6    |
|Wartortle |Water      |8    |
|Blastoise |Water      |9    |
+----------+-----------+-----+

We are going to add two columns to our dataframe pyspark using the lit() function.

PySpark Lit() function using select()

So we are going to add two columns which are :

  • A column that contains the pokemon generation
  • A column that contains only the pokemon string

To add these two columns using the select() function, you must proceed as follows:

# Import SQL function to use lit()

from pyspark.sql.functions import col,lit
poke = df.select(col("Name"),col("PrimaryType"),col("Index"),lit(1).alias("Generation"),lit("Pokemon").alias("Game"))
poke.show(truncate=False)
+----------+-----------+-----+----------+-------+
|Name      |PrimaryType|Index|Generation|Game   |
+----------+-----------+-----+----------+-------+
|Bulbasaur |Grass      |1    |1         |Pokemon|
|Ivysaur   |Grass      |2    |1         |Pokemon|
|Venusaur  |Grass      |3    |1         |Pokemon|
|Charmeleon|Fire       |5    |1         |Pokemon|
|Charizard |Fire       |6    |1         |Pokemon|
|Wartortle |Water      |8    |1         |Pokemon|
|Blastoise |Water      |9    |1         |Pokemon|
+----------+-----------+-----+----------+-------+

We see that the two columns have been added. For this we had to create a new dataframe.

Lit() function using withColumn()

It is also possible to use the withColumn() function to create new columns with lit() :

from pyspark.sql.functions import col,lit
poke = df.withColumn("Generation",lit(1)).withColumn("Game",lit("Pokemon"))
poke.show(truncate=False)

This produces the same result as below.

It is also possible to add conditions so that the new column contains multiple different values:

from pyspark.sql.functions import col,lit,when
poke = df.withColumn("Generation",lit(1)).withColumn("Game",when(col("Index") >= 5,lit("Pokemon Index >=5")).otherwise(lit("Pokemon Index < 5")))
poke.show(truncate=False)
+----------+-----------+-----+----------+-----------------+
|Name      |PrimaryType|Index|Generation|Game             |
+----------+-----------+-----+----------+-----------------+
|Bulbasaur |Grass      |1    |1         |Pokemon Index < 5|
|Ivysaur   |Grass      |2    |1         |Pokemon Index < 5|
|Venusaur  |Grass      |3    |1         |Pokemon Index < 5|
|Charmeleon|Fire       |5    |1         |Pokemon Index >=5|
|Charizard |Fire       |6    |1         |Pokemon Index >=5|
|Wartortle |Water      |8    |1         |Pokemon Index >=5|
|Blastoise |Water      |9    |1         |Pokemon Index >=5|
+----------+-----------+-----+----------+-----------------+

As you can see, we have different values depending on the index of each pokemon.

Conclusion

In this tutorial you have learned how to add a constant or literal value to your Pyspark dataframe using the SPARK SQL lit() function. These functions are very powerful for inserting new columns to our dataframe, it is also possible to use them to create columns that contain arrays, maps or structures.

I hope this course has interested you and feel free to share in the forum if you have any specific questions on the subject.

If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

See you soon for new tutorials !

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.