Pyspark Concat – Concatenate two columns in pyspark

pyspark concat

Pyspark concat: In this tutorial we will learn how to concatenate two columns or more in a pyspark dataframe.

Introduction

To concatenate several columns from a dataframe, pyspark.sql.functions provides two functions: concat() and concat_ws().

The Pyspark SQL concat() function is mainly used to concatenate several DataFrame columns into one column. It is possible to concatenate string, binary and array columns.

pyspark.sql.functions.concat(*cols)

The Pyspark SQL concat_ws() function concatenates several string columns into one column with a given separator or delimiter. Unlike the concat() function, the concat_ws() function allows to specify a separator without using the lit() function.

pyspark.sql.functions.concat_ws(sep*cols)

In the rest of this tutorial, we will see different examples of the use of these two functions:

  • Concatenate two columns in pyspark without a separator.
  • Concatenate columns in pyspark with a single space.
  • Concatenate columns with a comma as separator in pyspark.
  • Concatenate columns by removing spaces at the beginning and end of strings
  • Concatenate two columns of different types (string and integer)

To illustrate these different points, we will use the following pyspark dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim
from pyspark.sql.types import StructType, StructField, StringType,IntegerType

spark = SparkSession.builder.appName('pyspark - concat and concat_ws').getOrCreate()
sc = spark.sparkContext
  
tennis = [
    ("Roger","Federer",6630,5),
    ("Rafael","Nadal",9850,2),
    ("Novak","Djokovic",11830,1),
    ("Dominic","Thiem",9125,3),
    ("Daniil","Medvedev",6970,4),
    ("Alexander","Zverev",5525,6),
    ("Andrey","Rublev",3919,7)
]

schema = StructType([
        
         StructField('FirstName', StringType(), True),
         StructField('LastName', StringType(), True),
         StructField('Points', IntegerType(), True),
         StructField('Rank', IntegerType(), True)
         ])
 
df = spark.createDataFrame(data=tennis, schema = schema)
df.printSchema()
df.show(truncate=False)

This dataframe has 4 columns:

  • The tennis player’s first name
  • The tennis player’s last name
  • His number of points in the ATP rankings
  • Its ATP ranking

Concatenate two columns in pyspark without a separator

There are several methods to concatenate two or more columns without a separator. The first method consists in using the select() pyspark function.

Concatenate two columns using select()

df.select("*", concat(col("FirstName"),col("LastName")).alias("Player")).show()

The result is :

+---------+--------+------+----+---------------+
|FirstName|LastName|Points|Rank|         Player|
+---------+--------+------+----+---------------+
|    Roger| Federer|  6630|   5|   RogerFederer|
|   Rafael|   Nadal|  9850|   2|    RafaelNadal|
|    Novak|Djokovic| 11830|   1|  NovakDjokovic|
|  Dominic|   Thiem|  9125|   3|   DominicThiem|
|   Daniil|Medvedev|  6970|   4| DaniilMedvedev|
|Alexander|  Zverev|  5525|   6|AlexanderZverev|
|   Andrey|  Rublev|  3919|   7|   AndreyRublev|
+---------+--------+------+----+---------------+

The second method is to use the withColumn() function :

Concatenate two columns using withColumn()

df.withColumn('Player', concat(col('FirstName'),col('LastName')))

It produces the same result as the first method.

Concatenate columns in pyspark with a single space

To add a space between the values of each column, we can use the concat() function with the lit() function.

If you want to have more information about the lit() function, read this post.

Concatenate two columns with a single space using lit()

df.withColumn('Player', concat(col('FirstName'),lit(" "),col('LastName'))).show()
+---------+--------+------+----+----------------+
|FirstName|LastName|Points|Rank|          Player|
+---------+--------+------+----+----------------+
|    Roger| Federer|  6630|   5|   Roger Federer|
|   Rafael|   Nadal|  9850|   2|    Rafael Nadal|
|    Novak|Djokovic| 11830|   1|  Novak Djokovic|
|  Dominic|   Thiem|  9125|   3|   Dominic Thiem|
|   Daniil|Medvedev|  6970|   4| Daniil Medvedev|
|Alexander|  Zverev|  5525|   6|Alexander Zverev|
|   Andrey|  Rublev|  3919|   7|   Andrey Rublev|
+---------+--------+------+----+----------------+

To avoid using the lit() function, it is possible to use the concat_ws() function which allows you to add a separator between each value.

#Concatenate two columns with a single space using concat_ws()

df.withColumn('Player', concat_ws(" ",df.FirstName,df.LastName))

Concatenate columns with a comma as separator in pyspark

Previously we used space as a separator but it is possible to put any characters as a separator. In the following example we will use the comma as a separator:

#Concatenate two columns with a comma separator using lit() 

df.withColumn('Player', concat(col('FirstName'),lit(","),col('LastName'))).show()

+---------+--------+------+----+----------------+
|FirstName|LastName|Points|Rank|          Player|
+---------+--------+------+----+----------------+
|    Roger| Federer|  6630|   5|   Roger,Federer|
|   Rafael|   Nadal|  9850|   2|    Rafael,Nadal|
|    Novak|Djokovic| 11830|   1|  Novak,Djokovic|
|  Dominic|   Thiem|  9125|   3|   Dominic,Thiem|
|   Daniil|Medvedev|  6970|   4| Daniil,Medvedev|
|Alexander|  Zverev|  5525|   6|Alexander,Zverev|
|   Andrey|  Rublev|  3919|   7|   Andrey,Rublev|
+---------+--------+------+----+----------------+

It is also possible to use the concat_ws() function which produces the same result:

#Concatenate two columns with a comma separator using concat_ws() 

df.withColumn('Player', concat_ws(",",df.FirstName,df.LastName)).show()

Concatenate columns by removing spaces at the beginning and end of strings

To remove blanks at the beginning and the end of the string, it is possible to combine the concat() function with the pyspark trim() function:

#remove blanks using trim()

df.withColumn('Player', trim(concat(col('FirstName'),col('LastName')))).show()
+---------+--------+------+----+---------------+
|FirstName|LastName|Points|Rank|         Player|
+---------+--------+------+----+---------------+
|    Roger| Federer|  6630|   5|   RogerFederer|
|   Rafael|   Nadal|  9850|   2|    RafaelNadal|
|    Novak|Djokovic| 11830|   1|  NovakDjokovic|
|  Dominic|   Thiem|  9125|   3|   DominicThiem|
|   Daniil|Medvedev|  6970|   4| DaniilMedvedev|
|Alexander|  Zverev|  5525|   6|AlexanderZverev|
|   Andrey|  Rublev|  3919|   7|   AndreyRublev|
+---------+--------+------+----+---------------+

Concatenate numeric and string column in pyspark

The concat() function also concatenates columns that have a different type. In our example we will concatenate the FirstName column (String) with the Rank column (Integer):

Concatenate String and Integer Columns

df.withColumn('PlayerRank', concat(col('LastName'),lit(","),col('Rank'))).show()

+---------+--------+------+----+----------+
|FirstName|LastName|Points|Rank|PlayerRank|
+---------+--------+------+----+----------+
|    Roger| Federer|  6630|   5| Federer,5|
|   Rafael|   Nadal|  9850|   2|   Nadal,2|
|    Novak|Djokovic| 11830|   1|Djokovic,1|
|  Dominic|   Thiem|  9125|   3|   Thiem,3|
|   Daniil|Medvedev|  6970|   4|Medvedev,4|
|Alexander|  Zverev|  5525|   6|  Zverev,6|
|   Andrey|  Rublev|  3919|   7|  Rublev,7|
+---------+--------+------+----+----------+

Conclusion

In this tutorial you have learned how to use the concat() and concat_ws() functions in Pyspark SQL. These functions are quite simple to learn and can be very useful in everyday life. I hope this has helped you to get a better understanding of these two functions. If you have any questions, please do not hesitate to send me comments. I would be happy to answer them 🙂

Thanks for reading and See you soon for new tutorials !

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published. Required fields are marked *