Pyspark parallelize – Create RDD from a list collection

Pyspark parallelize: In this tutorial, we will see how to use the parallelize() function to create an RDD from a python list.
Introduction
The pyspark parallelize() function is a SparkContext function that creates an RDD from a python list. An RDD (Resilient Distributed Datasets) is a Pyspark data structure, it represents a collection of immutable and partitioned elements that can be operated in parallel.
Each RDD is characterized by five fundamental properties:
- A list of partitions
- A function to calculate each division
- A list of dependencies on other RDDs
- A partitioner for keyed RDDs (e.g. to say that the RDD is hashed partitioned)
- A list of preferred locations to calculate each split (for example, block locations for an HDFS file)
In this tutorial I will explain how to use the parallelize function from a list and then see how to create an empty RDD using this function.
Using Pyspark Parallelize() Function to Create RDD
To use the parallelize() function, we first need to create our SparkSession and the SparkContext. Here’s how to create them :
# Create SparkSession and SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark - parallelize').getOrCreate()
sc = spark.sparkContext
We will then create a list of elements to create our RDD.
# Create RDD using a list collection
list = [100,200,300,400,500,600,700,800,900,1000]
rdd = sc.parallelize(list)
To know the number of partitions used and the number of elements present in our RDD we can use respectively the functions getNumPartitions() and count() :
print("Partitions: "+str(rdd.getNumPartitions()))
print("1st element: "+str(rdd.first()))
print(rddC)
print("Number of elements in RDD -> {counts} ".format(counts=rdd.count()))
# Result
Partitions: 2
1st element: 100
[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
Number of elements in RDD -> 10
In our example below, we have created an RDD from a simple list, but it is also possible to create an RDD composed of a list of arrays:
# Create RDD using a list of arrays
myRDD = sc.parallelize( [('Hulk', 1), ('Thor', 2), ('Ironman',3), ('Black Panther',5), ('Captain America', 6)])
Create empty RDD by using pyspark parallelize() function
In some cases we need to create an empty RRD. The SparkContext.parallelize() function allows you to do this by specifying an empty list as a parameter of the function:
Create an empty RDD
rdd = sc.parallelize([])
print(str(rdd.isEmpty()))
Note: It is possible to use the emptyRDD() function in SparkContext instead of using this method.
Conclusion
In this article we have seen how to use the SparkContext.parallelize() function to create an RDD from a python list. This function allows Spark to distribute the data across multiple nodes, instead of relying on a single node to process the data. This feature improves the processing time of its program.
If you have any questions about how SparkContext.parallelize() works, please feel free to send me feedback.
See you soon for more tutorials.
Comments
Leave a comment