Pyspark parallelize – Create RDD from a list collection

pyspark parallelize

Pyspark parallelize: In this tutorial, we will see how to use the parallelize() function to create an RDD from a python list.

Introduction

The pyspark parallelize() function is a SparkContext function that creates an RDD from a python list. An RDD (Resilient Distributed Datasets) is a Pyspark data structure, it represents a collection of immutable and partitioned elements that can be operated in parallel.

Each RDD is characterized by five fundamental properties:

  • A list of partitions
  • A function to calculate each division
  • A list of dependencies on other RDDs
  • A partitioner for keyed RDDs (e.g. to say that the RDD is hashed partitioned)
  • A list of preferred locations to calculate each split (for example, block locations for an HDFS file)

In this tutorial I will explain how to use the parallelize function from a list and then see how to create an empty RDD using this function.

Using Pyspark Parallelize() Function to Create RDD

To use the parallelize() function, we first need to create our SparkSession and the SparkContext. Here’s how to create them :

# Create SparkSession and SparkContext

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark - parallelize').getOrCreate()
sc = spark.sparkContext

We will then create a list of elements to create our RDD.

# Create RDD using a list collection 

list = [100,200,300,400,500,600,700,800,900,1000]
rdd = sc.parallelize(list)

To know the number of partitions used and the number of elements present in our RDD we can use respectively the functions getNumPartitions() and count() :

print("Partitions: "+str(rdd.getNumPartitions()))
print("1st element: "+str(rdd.first()))
print(rddC)
print("Number of elements in RDD -> {counts} ".format(counts=rdd.count()))

# Result

Partitions: 2
1st element: 100
[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
Number of elements in RDD -> 10 

In our example below, we have created an RDD from a simple list, but it is also possible to create an RDD composed of a list of arrays:

# Create RDD using a list of arrays

myRDD = sc.parallelize(  [('Hulk', 1), ('Thor', 2), ('Ironman',3), ('Black Panther',5), ('Captain America', 6)])

Create empty RDD by using pyspark parallelize() function

In some cases we need to create an empty RRD. The SparkContext.parallelize() function allows you to do this by specifying an empty list as a parameter of the function:

Create an empty RDD

rdd = sc.parallelize([])
print(str(rdd.isEmpty()))

Note: It is possible to use the emptyRDD() function in SparkContext instead of using this method.

Conclusion

In this article we have seen how to use the SparkContext.parallelize() function to create an RDD from a python list. This function allows Spark to distribute the data across multiple nodes, instead of relying on a single node to process the data. This feature improves the processing time of its program.

If you have any questions about how SparkContext.parallelize() works, please feel free to send me feedback.

See you soon for more tutorials.

Back to Python Menu

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published. Required fields are marked *