PySpark reduceByKey : In this tutorial we will learn how to use the reducebykey function in spark.
If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :
The reduceByKey() function only applies to RDDs that contain key and value pairs. This is the case for RDDS with a map or a tuple as given elements.It uses an asssociative and commutative reduction function to merge the values of each key, which means that this function produces the same result when applied repeatedly to the same data set. The results are returned directly in the form of a dictionary.
When reduceByKey () is executed, the output will be partitioned either by numPartitions or by the default parallelism level (default partitioner is hash-partition). This is a function that can be very expensive in terms of resources since the data is often available on several partitions.
In this tutorial we are going to use the reduceByKey() function on the RDD of the following figure :
This example is equivalent to counting the number of occurrences of each key in our RDD.
PySpark reduceByKey() Syntax
The function takes up to 3 input parameters :
# Syntax reduceByKey() function reduceByKey( func , numPartitions = None , partitionFunc = <function portable_hash> )
As we saw in introduction, the output will be partitioned with numPartitions partitions.
ReduceByKey() Example Using PySpark
To test the reduceByKey() function we will use the following RDD :
from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StructType, StructField, StringType,IntegerType spark = SparkSession.builder.appName('pyspark - example reducebykey()').getOrCreate() sc = spark.sparkContext avengers = [ ("Hulk",1), ("Iron man",1), ("Hulk",1), ("Thor",1), ("Hulk",1), ("Iron man",1), ("Thor",1), ("Iron man",1), ("Spiderman",1), ("Thor",1) ] schema = StructType([ StructField('Name', StringType(), True), StructField('Index', StringType(), True) ]) rdd=spark.sparkContext.parallelize(avengers) print(rdd.take(10))
# Result [('Hulk', 1), ('Iron man', 1), ('Hulk', 1), ('Thor', 1), ('Hulk', 1), ('Iron man', 1), ('Thor', 1), ('Iron man', 1), ('Spiderman', 1), ('Thor', 1)]
We will then use the function to reduce the word string by applying the sum function on the values. This allows to count the number of values for each key.
# Apply reduceByKey() function rdd2=rdd.reduceByKey(lambda a,b: a+b) print(rdd2.collect())
# Result [('Thor', 3), ('Hulk', 3), ('Iron man', 3), ('Spiderman', 1)]
In this tutorial, we learned how to use the reduceByKey() function present in PySpark to merge the values of each key using an associative reduction function. This is very powerful but very resource consuming too.
I hope this tutorial has helped you to better understand the use of this function. Do not hesitate to tell me in comments if you have any problems using it.
If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :