Pyspark Concat – Concatenate two columns in pyspark

Pyspark concat: In this tutorial we will learn how to concatenate two columns or more in a pyspark dataframe.
If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :
Introduction
To concatenate several columns from a dataframe, pyspark.sql.functions provides two functions: concat() and concat_ws().
The Pyspark SQL concat() function is mainly used to concatenate several DataFrame columns into one column. It is possible to concatenate string, binary and array columns.
pyspark.sql.functions.concat(*cols)
The Pyspark SQL concat_ws() function concatenates several string columns into one column with a given separator or delimiter. Unlike the concat() function, the concat_ws() function allows to specify a separator without using the lit() function.
pyspark.sql.functions.concat_ws(sep, *cols)
In the rest of this tutorial, we will see different examples of the use of these two functions:
- Concatenate two columns in pyspark without a separator.
- Concatenate columns in pyspark with a single space.
- Concatenate columns with a comma as separator in pyspark.
- Concatenate columns by removing spaces at the beginning and end of strings
- Concatenate two columns of different types (string and integer)
To illustrate these different points, we will use the following pyspark dataframe:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - concat and concat_ws').getOrCreate()
sc = spark.sparkContext
tennis = [
("Roger","Federer",6630,5),
("Rafael","Nadal",9850,2),
("Novak","Djokovic",11830,1),
("Dominic","Thiem",9125,3),
("Daniil","Medvedev",6970,4),
("Alexander","Zverev",5525,6),
("Andrey","Rublev",3919,7)
]
schema = StructType([
StructField('FirstName', StringType(), True),
StructField('LastName', StringType(), True),
StructField('Points', IntegerType(), True),
StructField('Rank', IntegerType(), True)
])
df = spark.createDataFrame(data=tennis, schema = schema)
df.printSchema()
df.show(truncate=False)
This dataframe has 4 columns:
- The tennis player’s first name
- The tennis player’s last name
- His number of points in the ATP rankings
- Its ATP ranking
Concatenate two columns in pyspark without a separator
There are several methods to concatenate two or more columns without a separator. The first method consists in using the select() pyspark function.
Concatenate two columns using select()
df.select("*", concat(col("FirstName"),col("LastName")).alias("Player")).show()
The result is :
+---------+--------+------+----+---------------+
|FirstName|LastName|Points|Rank| Player|
+---------+--------+------+----+---------------+
| Roger| Federer| 6630| 5| RogerFederer|
| Rafael| Nadal| 9850| 2| RafaelNadal|
| Novak|Djokovic| 11830| 1| NovakDjokovic|
| Dominic| Thiem| 9125| 3| DominicThiem|
| Daniil|Medvedev| 6970| 4| DaniilMedvedev|
|Alexander| Zverev| 5525| 6|AlexanderZverev|
| Andrey| Rublev| 3919| 7| AndreyRublev|
+---------+--------+------+----+---------------+
The second method is to use the withColumn() function :
Concatenate two columns using withColumn()
df.withColumn('Player', concat(col('FirstName'),col('LastName')))
It produces the same result as the first method.
Concatenate columns in pyspark with a single space
To add a space between the values of each column, we can use the concat() function with the lit() function.
If you want to have more information about the lit() function, read this post.
Concatenate two columns with a single space using lit()
df.withColumn('Player', concat(col('FirstName'),lit(" "),col('LastName'))).show()
+---------+--------+------+----+----------------+
|FirstName|LastName|Points|Rank| Player|
+---------+--------+------+----+----------------+
| Roger| Federer| 6630| 5| Roger Federer|
| Rafael| Nadal| 9850| 2| Rafael Nadal|
| Novak|Djokovic| 11830| 1| Novak Djokovic|
| Dominic| Thiem| 9125| 3| Dominic Thiem|
| Daniil|Medvedev| 6970| 4| Daniil Medvedev|
|Alexander| Zverev| 5525| 6|Alexander Zverev|
| Andrey| Rublev| 3919| 7| Andrey Rublev|
+---------+--------+------+----+----------------+
To avoid using the lit() function, it is possible to use the concat_ws() function which allows you to add a separator between each value.
#Concatenate two columns with a single space using concat_ws()
df.withColumn('Player', concat_ws(" ",df.FirstName,df.LastName))
Concatenate columns with a comma as separator in pyspark
Previously we used space as a separator but it is possible to put any characters as a separator. In the following example we will use the comma as a separator:
#Concatenate two columns with a comma separator using lit()
df.withColumn('Player', concat(col('FirstName'),lit(","),col('LastName'))).show()
+---------+--------+------+----+----------------+
|FirstName|LastName|Points|Rank| Player|
+---------+--------+------+----+----------------+
| Roger| Federer| 6630| 5| Roger,Federer|
| Rafael| Nadal| 9850| 2| Rafael,Nadal|
| Novak|Djokovic| 11830| 1| Novak,Djokovic|
| Dominic| Thiem| 9125| 3| Dominic,Thiem|
| Daniil|Medvedev| 6970| 4| Daniil,Medvedev|
|Alexander| Zverev| 5525| 6|Alexander,Zverev|
| Andrey| Rublev| 3919| 7| Andrey,Rublev|
+---------+--------+------+----+----------------+
It is also possible to use the concat_ws() function which produces the same result:
#Concatenate two columns with a comma separator using concat_ws()
df.withColumn('Player', concat_ws(",",df.FirstName,df.LastName)).show()
Concatenate columns by removing spaces at the beginning and end of strings
To remove blanks at the beginning and the end of the string, it is possible to combine the concat() function with the pyspark trim() function:
#remove blanks using trim()
df.withColumn('Player', trim(concat(col('FirstName'),col('LastName')))).show()
+---------+--------+------+----+---------------+
|FirstName|LastName|Points|Rank| Player|
+---------+--------+------+----+---------------+
| Roger| Federer| 6630| 5| RogerFederer|
| Rafael| Nadal| 9850| 2| RafaelNadal|
| Novak|Djokovic| 11830| 1| NovakDjokovic|
| Dominic| Thiem| 9125| 3| DominicThiem|
| Daniil|Medvedev| 6970| 4| DaniilMedvedev|
|Alexander| Zverev| 5525| 6|AlexanderZverev|
| Andrey| Rublev| 3919| 7| AndreyRublev|
+---------+--------+------+----+---------------+
Concatenate numeric and string column in pyspark
The concat() function also concatenates columns that have a different type. In our example we will concatenate the FirstName column (String) with the Rank column (Integer):
Concatenate String and Integer Columns
df.withColumn('PlayerRank', concat(col('LastName'),lit(","),col('Rank'))).show()
+---------+--------+------+----+----------+
|FirstName|LastName|Points|Rank|PlayerRank|
+---------+--------+------+----+----------+
| Roger| Federer| 6630| 5| Federer,5|
| Rafael| Nadal| 9850| 2| Nadal,2|
| Novak|Djokovic| 11830| 1|Djokovic,1|
| Dominic| Thiem| 9125| 3| Thiem,3|
| Daniil|Medvedev| 6970| 4|Medvedev,4|
|Alexander| Zverev| 5525| 6| Zverev,6|
| Andrey| Rublev| 3919| 7| Rublev,7|
+---------+--------+------+----+----------+
Conclusion
In this tutorial you have learned how to use the concat() and concat_ws() functions in Pyspark SQL. These functions are quite simple to learn and can be very useful in everyday life. I hope this has helped you to get a better understanding of these two functions. If you have any questions, please do not hesitate to send me comments. I would be happy to answer them 🙂
Thanks for reading and See you soon for new tutorials !
Comments
Leave a comment