Python

Pandas describe(): Compute Summary Statistics From Your Dataframe

By ayed_amira , on 08/18/2021 , 3 comments - 3 minutes to read
pandas describe

Pandas describe(): In this article, I will explain how to compute summary statistics of your dataframe using the pandas describe() function.

Introduction

A pandas dataframe is a two-dimensional tabular data structure that can be modified in size with labeled axes that are commonly referred to as row and column labels, with different arithmetic operations aligned with the row and column labels.

The Pandas library, available on python, allows to import data and to make quick analysis on loaded data.

The pandas describe() function is as its name suggests used to describe data with basic statistical details of a data frame or a series of numerical values.

The function can provide you with all this information:

  • The count of values
  • The number of unique values
  • The top (most frequent) value
  • The frequency of your top value
  • The mean, standard deviation, min and max values
  • By default, The percentiles of your data: 25%, 50%, 75%
pandas describe example
describe() example

In the rest of the tutorial, we will go through these different points:

  • Describe a dataframe (by default)
  • Describe a single column
  • Specify what percentiles to include in the summary
  • Including all columns via ‘include’

To discuss these different points, we will use the following pandas dataframe:

import pandas as pd

university = [("Harvard",20409, 10200, 61),
           ("MIT",18389, 12000, 55),
           ("Oxford",2209, 1200, 23),
           ("University of Montreal",1545, 721, 12),
           ("Stanford", 7309, 82)
           ]

df = pd.DataFrame(university, columns=['Name','number_students', 'int_students', 'number_courses'])

print(df)

Output :


                     Name  number_students  int_students  number_courses
0                 Harvard            20409         10200            61.0
1                     MIT            18389         12000            55.0
2                  Oxford             2209          1200            23.0
3  University of Montreal             1545           721            12.0
4                Stanford             7309            82             NaN

Pandas describe()

Pandas describe() – Syntax and parameters

Syntax

#Syntax

DataFrame.describe(percentiles=None, 
include=None,
exclude=None)

Parameters

The function can take 3 parameters which are the following:

NameDescriptionTypeDefault ValueRequired
percentilelist like data type of numbers between 0-1 to return the respective percentilelist[.25, .5, .75]No
include includes the list of the data types while describing the DataFramelistNoneNo
excludeexclude the list of data types while describing DataFramelistNoneNo

Pandas Describe a dataframe

By default, the describe function provides all the descriptive statistics of the numerical columns:

# Default 

print(df.describe())

 number_students  int_students  number_courses
count         5.000000      5.000000        4.000000
mean       9972.200000   4840.600000       37.750000
std        8918.336796   5763.018376       23.935678
min        1545.000000     82.000000       12.000000
25%        2209.000000    721.000000       20.250000
50%        7309.000000   1200.000000       39.000000
75%       18389.000000  10200.000000       56.500000
max       20409.000000  12000.000000       61.000000

As you can see the first column ‘name’ does not appear in the output of the function (since it is a String).

Pandas Describe a single column

If you want to compute statistics only on a single column, you should proceed as follows:

# Describe a single column

print(df.number_students.describe())

Output:

count        5.000000
mean      9972.200000
std       8918.336796
min       1545.000000
25%       2209.000000
50%       7309.000000
75%      18389.000000
max      20409.000000
Name: number_students, dtype: float64

Specify what percentiles to include in the summary

If you want to analyze your data more precisely, it is possible to change the quartiles into percentiles or deciles:

# Modify the percentiles

print(df.describe(percentiles=[.1, 0.01,0.3]))

Output:

 number_students  int_students  number_courses
count         5.000000      5.000000        4.000000
mean       9972.200000   4840.600000       37.750000
std        8918.336796   5763.018376       23.935678
min        1545.000000     82.000000       12.000000
1%         1571.560000    107.560000       12.330000
10%        1810.600000    337.600000       15.300000
30%        3229.000000    816.800000       21.900000
50%        7309.000000   1200.000000       39.000000
max       20409.000000  12000.000000       61.000000

Including all columns via ‘include’

We have previously seen that the describe() function takes by default only the numeric columns. It is possible with the ‘include’ parameter to take into account also other data types:

# including all columns 

print(df.describe(include='all'))

Output:

Name  number_students  int_students  number_courses
count                        5         5.000000      5.000000        4.000000
unique                       5              NaN           NaN             NaN
top     University of Montreal              NaN           NaN             NaN
freq                         1              NaN           NaN             NaN
mean                       NaN      9972.200000   4840.600000       37.750000
std                        NaN      8918.336796   5763.018376       23.935678
min                        NaN      1545.000000     82.000000       12.000000
25%                        NaN      2209.000000    721.000000       20.250000
50%                        NaN      7309.000000   1200.000000       39.000000
75%                        NaN     18389.000000  10200.000000       56.500000
max                        NaN     20409.000000  12000.000000       61.000000

Conclusion

Throughout the tutorial we have seen how to use the describe() function present in the Pandas library. To start analyzing and understanding this data, this function gives us a lot of information about the data in our dataframe in a very simple way.

If you have any questions about this function, don’t hesitate to tell me in comments 🙂

See you soon for new tutorials.

Back to Python Menu

ayed_amira

I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)

Comments

Leave a comment

Your comment will be revised by the site if needed.