How to Convert Pyspark Dataframe to Pandas

convert pyspark dataframe to pandas

In this tutorial we will see how to convert a pyspark dataframe into a pandas using the toPandas() function.

Introduction

After having processed the data in PySpark, we sometimes have to reconvert our pyspark dataframe to use some machine learning applications (indeed some machine learning models are not implemented in pyspark, for example XGBoost).

However, the toPandas() function is one of the most expensive operations and should therefore be used with care, especially if we are dealing with large volumes of data.

Pandas DataFrames are stored in RAM directly, this has the advantage of processing operations faster but is limited by the size of our dataframe in memory.

On the other hand, DataFrames Spark are distributed across the nodes of the Spark Cluster, which is made up of at least one machine, so the size of the DataFrames is limited by the size of the cluster. If we need to store more data, simply add more clusters to have more nodes.

pyspark dataframe to csv - spark vs python
Pandas DataFrames are stored in memory while spark dataframes are on several clusters.

It is therefore important to understand that when the toPandas() method is executed on a Spark DataFrame, the pilot program must have enough memory to accommodate the data otherwise an error will be raised.

Convert Pyspark Dataframe to pandas using toPandas()

Convert to Pandas DataFrame

First of all, we will create a Pyspark dataframe :

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example toPandas()').getOrCreate()
sc = spark.sparkContext
  
pokedex = [
    ("Bulbasaur","Grass",1),
    ("Ivysaur","Grass",2),
    ("Venusaur","Grass",3),
    ("Charmeleon","Fire",5),
    ("Charizard","Fire",6),
    ("Wartortle","Water",8),
    ("Blastoise","Water",9)
]
 
schema = ["Name","PrimaryType","Index"]
df = spark.createDataFrame(data=pokedex, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Name: string (nullable = true)
 |-- PrimaryType: string (nullable = true)
 |-- Index: long (nullable = true)

+----------+-----------+-----+
|Name      |PrimaryType|Index|
+----------+-----------+-----+
|Bulbasaur |Grass      |1    |
|Ivysaur   |Grass      |2    |
|Venusaur  |Grass      |3    |
|Charmeleon|Fire       |5    |
|Charizard |Fire       |6    |
|Wartortle |Water      |8    |
|Blastoise |Water      |9    |

We saw in introduction that PySpark provides a toPandas() method to convert our dataframe to Python Pandas DataFrame. The toPandas() function results in the collection of all records from the PySpark DataFrame to the pilot program. Running on a larger dataset will cause a memory error and crash the application.

If your dataframe is of a suitable size, you can use the function like this :

# Convert pyspark dataframe to pandas dataframe

dfPandas = df.toPandas()
print(dfPandas)
    Name PrimaryType  Index
0   Bulbasaur       Grass      1
1     Ivysaur       Grass      2
2    Venusaur       Grass      3
3  Charmeleon        Fire      5
4   Charizard        Fire      6
5   Wartortle       Water      8
6   Blastoise       Water      9

Note : Pandas add the index number for each record

Convert nested struct DataFrame

The data in a PySpark dataFrame is often in a structured format. Here is an example with a nested structure:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('pyspark - example toPandas()').getOrCreate()
sc = spark.sparkContext
  
pokedex = [
    ("Bulbasaur",("Grass","Poison"),1),
    ("Ivysaur",("Grass","Poison"),2),
    ("Venusaur",("Grass","Poison"),3),
    ("Charmeleon",("Fire","Fire"),5),
    ("Charizard",("Fire","Flying"),6),
    ("Wartortle",("Water","Water"),8),
    ("Blastoise",("Water","Water"),9)
]

schema = StructType([
        
         StructField('Name', StringType(), True),
         StructField('Type', StructType([
             StructField('Primary', StringType(), True),
             StructField('Secondary', StringType(), True)
             ])),
         StructField('Index', StringType(), True)
         ])
 
df = spark.createDataFrame(data=pokedex, schema = schema)
df.printSchema()
df.show(truncate=False)
root
 |-- Name: string (nullable = true)
 |-- Type: struct (nullable = true)
 |    |-- Primary: string (nullable = true)
 |    |-- Secondary: string (nullable = true)
 |-- Index: string (nullable = true)

+----------+---------------+-----+
|Name      |Type           |Index|
+----------+---------------+-----+
|Bulbasaur |[Grass, Poison]|1    |
|Ivysaur   |[Grass, Poison]|2    |
|Venusaur  |[Grass, Poison]|3    |
|Charmeleon|[Fire, Fire]   |5    |
|Charizard |[Fire, Flying] |6    |
|Wartortle |[Water, Water] |8    |
|Blastoise |[Water, Water] |9    |
+----------+---------------+-----+

And this is what we get when we use the toPandas() method:

dfPandas = df.toPandas()
print(dfPandas)
         Name             Type Index
0   Bulbasaur  (Grass, Poison)     1
1     Ivysaur  (Grass, Poison)     2
2    Venusaur  (Grass, Poison)     3
3  Charmeleon     (Fire, Fire)     5
4   Charizard   (Fire, Flying)     6
5   Wartortle   (Water, Water)     8
6   Blastoise   (Water, Water)     9

Conclusion

In this article, you have learned how to convert the pyspark dataframe into pandas using the toPandas () function of the PySpark DataFrame.

As we have already mentioned, the toPandas() method is a very expensive operation that must be used sparingly in order to minimize the impact on the performance of our Spark applications.

Feel free to leave a comment if you need help using this feature, I would be happy to answer 🙂

If you want to learn more about spark, you can read one of those books : (As an Amazon Partner, I make a profit on qualifying purchases) :

Knowledge is always a booty

Back to the python section

Published
Categorized as Python

By ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Leave a comment

Your email address will not be published.