PySpark Substring From a Dataframe Column

pyspark substring

PySpark Substring : In this tutorial we will see how to get a substring of a column on PySpark dataframe.

Introduction

There are several methods to extract a substring from a DataFrame string column:

  • The substring() function: This function is available using SPARK SQL in the pyspark.sql.functions module.
  • The substr() function: The function is also available through SPARK SQL but in the pyspark.sql.Column module.

In this tutorial, I will show you how to get the substring of the column in pyspark using the substring() and substr() functions and also show you how to get a substring starting towards the end of the string.

Pyspark Substring Using SQL Function substring()

We have seen that the substring() function is available thanks to the pyspark.sql.functions module. The syntax of the function is as follows :

# Syntax substring()

substring(str, pos, len)

The function takes 3 parameters :

  • str : the string whose substring we want to extract
  • pos: the position at which the substring starts
  • len: the length of the substring to be extracted

The substring starts from the position specified in the parameter pos and is of length len when str is String type.

Note : It is important to note that the index position is not based on 0 but starts from 1.

Pyspark substring() function using withColumn()

Below is an example of the substring() function using withColumn() :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - substring() and substr()').getOrCreate()
sc = spark.sparkContext
  
web = [
    ("AMIRADATA","BLOG"),
    ("FACEBOOK","SOCIAL"),
    ("GOOGLE","SEARCH ENGINE"),
    ("INSTAGRAM","SOCIAL"),
    ("TWITTER","SOCIAL"),
]

schema = StructType([
        
         StructField('Website', StringType(), True),
         StructField('Type', StringType(), True)
         ])
 
df = spark.createDataFrame(data=web , schema = schema)
df = df.withColumn('Website1', f.substring('Website', 1,4))
df.printSchema()
df.show(truncate=False)

root
 |-- Website: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Website1: string (nullable = true)

+---------+-------------+--------+
|Website  |Type         |Website1|
+---------+-------------+--------+
|AMIRADATA|BLOG         |AMIR    |
|FACEBOOK |SOCIAL       |FACE    |
|GOOGLE   |SEARCH ENGINE|GOOG    |
|INSTAGRAM|SOCIAL       |INST    |
|TWITTER  |SOCIAL       |TWIT    |
+---------+-------------+--------+

In this example we have extracted the first 4 characters of the string from the Website column.

Pyspark substring() function using select()

We can get the substring of a column by using the select() function. Here is an example of its use :

# We use the previous dataframe

df=df.select('Website', f.substring('Website', 1,4).alias('Website1'))
df.printSchema()
df.show(truncate=False)

Get Substring from end of the column

At times, it can be interesting to leave towards the end of the string:

df = spark.createDataFrame(data=web , schema = schema)
df=df.select('Website', f.substring('Website', -4,4).alias('Website1'))
df.printSchema()
df.show(truncate=False)
+---------+--------+
|Website  |Website1|
+---------+--------+
|AMIRADATA|DATA    |
|FACEBOOK |BOOK    |
|GOOGLE   |OGLE    |
|INSTAGRAM|GRAM    |
|TWITTER  |TTER    |
+---------+--------+

Using substr() from pyspark.sql.Column

We also saw that it was possible to use the substr() function available in the pyspark.sql.Column module:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName('pyspark - substring() and substr()').getOrCreate()
sc = spark.sparkContext
  
web = [
    ("AMIRADATA","BLOG"),
    ("FACEBOOK","SOCIAL"),
    ("GOOGLE","SEARCH ENGINE"),
    ("INSTAGRAM","SOCIAL"),
    ("TWITTER","SOCIAL"),
]

schema = StructType([
        
         StructField('Website', StringType(), True),
         StructField('Type', StringType(), True)
         ])
 
df = spark.createDataFrame(data=web , schema = schema)
df = df.withColumn('Website1', col('Website').substr(1, 4))
df.printSchema()
df.show(truncate=False)
root
 |-- Website: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Website1: string (nullable = true)

+---------+-------------+--------+
|Website  |Type         |Website1|
+---------+-------------+--------+
|AMIRADATA|BLOG         |AMIR    |
|FACEBOOK |SOCIAL       |FACE    |
|GOOGLE   |SEARCH ENGINE|GOOG    |
|INSTAGRAM|SOCIAL       |INST    |
|TWITTER  |SOCIAL       |TWIT    |
+---------+-------------+--------+

This produces the same result as the substring() function.

Conclusion

In this tutorial, we learned how to get a substring of a column in a DataFrame. I hope this tutorial interested you and don’t hesitate to leave me a comment if you have any questions about either of these 2 methods!


If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

The future belongs to those who believe in the beauty of their dreams.

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.