Pandas describe(): Compute Summary Statistics From Your Dataframe

Pandas describe(): In this article, I will explain how to compute summary statistics of your dataframe using the pandas describe() function.
Introduction
A pandas dataframe is a two-dimensional tabular data structure that can be modified in size with labeled axes that are commonly referred to as row and column labels, with different arithmetic operations aligned with the row and column labels.
The Pandas library, available on python, allows to import data and to make quick analysis on loaded data.
The pandas describe() function is as its name suggests used to describe data with basic statistical details of a data frame or a series of numerical values.
The function can provide you with all this information:
- The count of values
- The number of unique values
- The top (most frequent) value
- The frequency of your top value
- The mean, standard deviation, min and max values
- By default, The percentiles of your data: 25%, 50%, 75%

In the rest of the tutorial, we will go through these different points:
- Describe a dataframe (by default)
- Describe a single column
- Specify what percentiles to include in the summary
- Including all columns via ‘include’
To discuss these different points, we will use the following pandas dataframe:
import pandas as pd
university = [("Harvard",20409, 10200, 61),
("MIT",18389, 12000, 55),
("Oxford",2209, 1200, 23),
("University of Montreal",1545, 721, 12),
("Stanford", 7309, 82)
]
df = pd.DataFrame(university, columns=['Name','number_students', 'int_students', 'number_courses'])
print(df)
Output :
Name number_students int_students number_courses
0 Harvard 20409 10200 61.0
1 MIT 18389 12000 55.0
2 Oxford 2209 1200 23.0
3 University of Montreal 1545 721 12.0
4 Stanford 7309 82 NaN
Pandas describe()
Pandas describe() – Syntax and parameters
Syntax
#Syntax
DataFrame.describe(percentiles=None,
include=None,
exclude=None)
Parameters
The function can take 3 parameters which are the following:
Name | Description | Type | Default Value | Required |
---|---|---|---|---|
percentile | list like data type of numbers between 0-1 to return the respective percentile | list | [.25, .5, .75] | No |
include | includes the list of the data types while describing the DataFrame | list | None | No |
exclude | exclude the list of data types while describing DataFrame | list | None | No |
Pandas Describe a dataframe
By default, the describe function provides all the descriptive statistics of the numerical columns:
# Default
print(df.describe())
number_students int_students number_courses
count 5.000000 5.000000 4.000000
mean 9972.200000 4840.600000 37.750000
std 8918.336796 5763.018376 23.935678
min 1545.000000 82.000000 12.000000
25% 2209.000000 721.000000 20.250000
50% 7309.000000 1200.000000 39.000000
75% 18389.000000 10200.000000 56.500000
max 20409.000000 12000.000000 61.000000
As you can see the first column ‘name’ does not appear in the output of the function (since it is a String).
Pandas Describe a single column
If you want to compute statistics only on a single column, you should proceed as follows:
# Describe a single column
print(df.number_students.describe())
Output:
count 5.000000
mean 9972.200000
std 8918.336796
min 1545.000000
25% 2209.000000
50% 7309.000000
75% 18389.000000
max 20409.000000
Name: number_students, dtype: float64
Specify what percentiles to include in the summary
If you want to analyze your data more precisely, it is possible to change the quartiles into percentiles or deciles:
# Modify the percentiles
print(df.describe(percentiles=[.1, 0.01,0.3]))
Output:
number_students int_students number_courses
count 5.000000 5.000000 4.000000
mean 9972.200000 4840.600000 37.750000
std 8918.336796 5763.018376 23.935678
min 1545.000000 82.000000 12.000000
1% 1571.560000 107.560000 12.330000
10% 1810.600000 337.600000 15.300000
30% 3229.000000 816.800000 21.900000
50% 7309.000000 1200.000000 39.000000
max 20409.000000 12000.000000 61.000000
Including all columns via ‘include’
We have previously seen that the describe() function takes by default only the numeric columns. It is possible with the ‘include’ parameter to take into account also other data types:
# including all columns
print(df.describe(include='all'))
Output:
Name number_students int_students number_courses
count 5 5.000000 5.000000 4.000000
unique 5 NaN NaN NaN
top University of Montreal NaN NaN NaN
freq 1 NaN NaN NaN
mean NaN 9972.200000 4840.600000 37.750000
std NaN 8918.336796 5763.018376 23.935678
min NaN 1545.000000 82.000000 12.000000
25% NaN 2209.000000 721.000000 20.250000
50% NaN 7309.000000 1200.000000 39.000000
75% NaN 18389.000000 10200.000000 56.500000
max NaN 20409.000000 12000.000000 61.000000
Conclusion
Throughout the tutorial we have seen how to use the describe() function present in the Pandas library. To start analyzing and understanding this data, this function gives us a lot of information about the data in our dataframe in a very simple way.
If you have any questions about this function, don’t hesitate to tell me in comments 🙂
See you soon for new tutorials.
Comments
Leave a comment