PySpark reduceByKey With Example

pyspark reducebykey

PySpark reduceByKey : In this tutorial we will learn how to use the reducebykey function in spark.

If you want to learn more about spark, you can read this book : (As an Amazon Partner, I make a profit on qualifying purchases) :

Introduction

The reduceByKey() function only applies to RDDs that contain key and value pairs. This is the case for RDDS with a map or a tuple as given elements.It uses an asssociative and commutative reduction function to merge the values of each key, which means that this function produces the same result when applied repeatedly to the same data set. The results are returned directly in the form of a dictionary.

When reduceByKey () is executed, the output will be partitioned either by numPartitions or by the default parallelism level (default partitioner is hash-partition). This is a function that can be very expensive in terms of resources since the data is often available on several partitions.

In this tutorial we are going to use the reduceByKey() function on the RDD of the following figure :

pyspark reducebykey example
Apply PySpark reduceByKey() function – Example

This example is equivalent to counting the number of occurrences of each key in our RDD.

PySpark reduceByKey() Syntax

The function takes up to 3 input parameters :

# Syntax reduceByKey() function

reduceByKey( func , numPartitions = None , partitionFunc = <function portable_hash> )

As we saw in introduction, the output will be partitioned with numPartitions partitions.

ReduceByKey() Example Using PySpark

To test the reduceByKey() function we will use the following RDD :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - example reducebykey()').getOrCreate()
sc = spark.sparkContext
  
avengers = [
    ("Hulk",1),
    ("Iron man",1),
    ("Hulk",1),
    ("Thor",1),
    ("Hulk",1),
    ("Iron man",1),
    ("Thor",1),
    ("Iron man",1),
    ("Spiderman",1),
    ("Thor",1)
]

schema = StructType([
        
         StructField('Name', StringType(), True),
         StructField('Index', StringType(), True)
         ])

rdd=spark.sparkContext.parallelize(avengers)
print(rdd.take(10))
# Result

[('Hulk', 1), ('Iron man', 1), ('Hulk', 1), ('Thor', 1), ('Hulk', 1), ('Iron man', 1), ('Thor', 1), ('Iron man', 1), ('Spiderman', 1), ('Thor', 1)]

We will then use the function to reduce the word string by applying the sum function on the values. This allows to count the number of values for each key.

# Apply reduceByKey() function

rdd2=rdd.reduceByKey(lambda a,b: a+b)
print(rdd2.collect())
# Result
[('Thor', 3), ('Hulk', 3), ('Iron man', 3), ('Spiderman', 1)]

Conclusion

In this tutorial, we learned how to use the reduceByKey() function present in PySpark to merge the values of each key using an associative reduction function. This is very powerful but very resource consuming too.

I hope this tutorial has helped you to better understand the use of this function. Do not hesitate to tell me in comments if you have any problems using it.


If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.