Python

PySpark Create an Empty Dataframe Using emptyRDD()

By ayed_amira , on 09/14/2020 , updated on 09/14/2020 - 3 minutes to read
pyspark create an empty dataframe

The purpose of this article is to explain you how to create an empty dataframe in pyspark.

Introduction

In some cases it may be necessary to create an empty dataframe. For example when a stream could not send the data to our dataframe, we want the rest of the operations or transformations on this dataframe to be able to continue even if the dataframe in question is empty. To handle these different cases, we need to create a dataframe with the same schema.

There are two ways to do this to create an empty dataframe:

  • By using the emptyRDD() function
  • By simply using the syntax [] and specifying the dataframe schema

In the rest of this tutorial, we will explain how to use these two methods.

Create PySpark empty DataFrame using emptyRDD()

In order to create an empty dataframe, we must first create an empty RRD. The easiest way to create an empty RRD is to use the spark.sparkContext.emptyRDD() function.

Once we have created an empty RDD, we have to specify the schema of the dataframe we want to create. Here is the syntax to create our empty dataframe pyspark :

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - create empty dataframe').getOrCreate()
sc = spark.sparkContext


schema = StructType([
  StructField('Pokemon', StringType(), True),
  StructField('PrimaryType', StringType(), True),
  StructField('Index', IntegerType(), True)
  ])

df = spark.createDataFrame(sc.emptyRDD(),schema)
df.printSchema()

Our dataframe is thus created and does not contain any data. With the printSchema() function, we can see that the schema has been taken into consideration:

root
 |-- Pokemon: string (nullable = true)
 |-- PrimaryType: string (nullable = true)
 |-- Index: integer (nullable = true)

Create PySpark empty DataFrame using syntax []

If you don’t want to use the emptyRDD() function, you can use the [] syntax which produces the same result :

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - create empty dataframe').getOrCreate()
sc = spark.sparkContext


schema = StructType([
  StructField('Pokemon', StringType(), True),
  StructField('PrimaryType', StringType(), True),
  StructField('Index', IntegerType(), True)
  ])

df = spark.createDataFrame([],schema)
df.printSchema()
root
 |-- Pokemon: string (nullable = true)
 |-- PrimaryType: string (nullable = true)
 |-- Index: integer (nullable = true)

Conclusion

As you have seen, it is easy to create an empty data frame in Pyspark. It is not necessary to go through this step to create a dataframe but in some cases it is useful to use this method.


If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

I hope this tutorial has helped you a lot and we’ll see you soon for new courses!

Back to the python section

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.