Convert Pyspark String to Date Format

convert pyspark string to date format

In this tutorial we will see how to convert a string into a date format using the PySpark to_date() function.

Introduction

It is often useful to convert a string to date format and vice versa. To convert a string to a date, we can use the to_date() function in SPARK SQL. To do the opposite, we need to use the cast() function, taking as argument a StringType () structure.

Note that Spark Date Functions supports all Java date formats specified in DateTimeFormatter such as :

  • ‘2011-12-03’
  • 3 Jun 2008 11:05:30
  • ‘20111203’

We will take as an example this pyspark dataframe :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - example conversion string to date').getOrCreate()
sc = spark.sparkContext
  
pokedex = [
    ("RIHANNA","1988-02-20"),
    ("BEYONCE","1981-09-04"),
    ("DUA LIPA","1995-08-22")
]

schema = StructType([
        
         StructField('Artist', StringType(), True),
         StructField('DATE_BIRTH', StringType(), True)
         ])
 
df = spark.createDataFrame(data=pokedex, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Artist: string (nullable = true)
 |-- DATE_BIRTH: string (nullable = true)

+--------+----------+
|Artist  |DATE_BIRTH|
+--------+----------+
|RIHANNA |1988-02-20|
|BEYONCE |1981-09-04|
|DUA LIPA|1995-08-22|
+--------+----------+

As you can see with the printschema(), the date_birth column type is a string. Let’s see how to change this.

Cast string column to date column in pyspark

As we saw in the introduction, the date_birth column is a string :

df.select("DATE_BIRTH").dtypes
# Return
[('DATE_BIRTH', 'string')]

If we want to do operations on the date column (e.g. to calculate the age of the artists according to the current date) we need to convert the string into a date like this :

from pyspark.sql.functions import to_date

df1 = df.withColumn('birth_date',to_date(df.DATE_BIRTH, 'yyyy-MM-dd'))
df1.printSchema()
df1.select("birth_date").dtypes
root
 |-- Artist: string (nullable = true)
 |-- DATE_BIRTH: string (nullable = true)
 |-- birth_date: date (nullable = true)

[('birth_date', 'date')]

Cast date column to string column in pyspark

To convert a date into a string, just use the cast() function with a StringType() as argument.

# Cast date to string

from pyspark.sql.types import StringType

df2 = df1.withColumn("birthday",df1["birth_date"].cast(StringType()))
df2.printSchema()
df2.select("birthday").dtypes
root
 |-- Artist: string (nullable = true)
 |-- DATE_BIRTH: string (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- birthday: string (nullable = true)

[('birthday', 'string')]

Conclusion

We have seen in this tutorial that it is quite simple to convert a string into a date format. However, we will have to check that the string is in the desired date format, in which case it will return either an error or a null value.

Don’t hesitate to tell me in comments if you are blocked in the use of one of these functions.


If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

 Logic will get you from A to Z ; Imagination will get you everywhere

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.