What is cohort analysis?
Cohort analysis consists in studying the characteristics of cohorts / vintages / generations, united by common temporal characteristics..
A cohort/vintage/generation is a group formed in a specific way based on time: for example, the month of registration, the month of the first transaction, or the first visit to the site. Cohorts are very similar to segments, with the difference that a cohort includes groups of a certain period of time, while a segment can be based on any other characteristics.
Why is it valuable?
This kind of analysis can be helpful when it comes to understanding the health of your business and the stickiness of your customers. Stickiness is critical, as it is much cheaper and easier to retain a customer than it is to acquire new ones. Also, your product evolves over time. New features are added and removed, design changes, etc. Observing individual groups over time is the starting point for understanding how these changes affect user/group behavior.
The problem
For this task, we use the public dataset https://www.kaggle.com/olistbr/brazilian-ecommerce and the olist_orders_dataset.csv and olist_order_payments_dataset.csv files. You can connect them by order_id
.
Answer two questions:
- How many orders on average and how much do all cohorts make in the first year
- Compare any two cohorts by profit and orders
Implementing Cohort Analysis in Python
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker
import matplotlib as mpl
pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2
%matplotlib inline
sns.set_context(
"notebook",
font_scale = 1.5,
rc = {
"figure.figsize" : (30, 30),
"axes.titlesize" : 18
}
)
df_orders = pd.read_csv('data/olist_orders_dataset.csv')
df_payments = pd.read_csv('data/olist_order_payments_dataset.csv')
We import all the necessary libraries and read the files.
We will build cohorts on the date of purchase (order_purchase_timestamp
), so we will convert it to datetime64. Let’s create an order_id
and connect our datasets by it.
df_orders['order_purchase_timestamp'] = pd.to_datetime(df_orders['order_purchase_timestamp'])
df_payments = df_payments.set_index('order_id')
df_orders = df_orders.set_index('order_id')
order_payment = df_orders.join(df_payments)
Since we are doing monthly cohorts, we will be looking at the overall monthly behavior of our clients. So we don’t need granular order_purchase_timestamp
.
order_payment.reset_index(inplace=True)
order_payment['Period'] = order_payment.order_purchase_timestamp.dt.strftime('%Y-%m')
Let’s create a new column, CohortGroup
, which is the year and month when the order was first delivered to the customer.
order_payment.set_index('customer_id', inplace=True)
order_payment['CohortGroup'] = order_payment.groupby(level=0)['order_purchase_timestamp'].min().dt.strftime('%Y-%m')#.dt.apply(lambda x: x.strftime('%Y-%m'))
order_payment.reset_index(inplace=True)
Considering monthly cohorts, we need to aggregate customers, purchases and purchase amounts for each CohortGroup
within a month (Period
).
grouped = order_payment.groupby(['CohortGroup', 'Period'])
cohorts = grouped.agg({'customer_id': pd.Series.nunique,
'payment_value': 'sum',
'order_id': 'count'})
cohorts.rename(columns={'customer_id': 'TotalClients',
'order_id': 'TotalOrders'}, inplace=True)
We want to see how each cohort behaved in the months following their first purchase, so we will need to index each cohort before their first month. For example, CohortPeriod = 1 would be the first month of the cohort, CohortPeriod = 2 would be their second, and so on. This allows us to compare cohorts at different stages of their lives.
def cohort_period(df):
df['CohortPeriod'] = np.arange(len(df)) + 1
return df
cohorts = cohorts.groupby(level=0).apply(cohort_period)
print(cohorts.head(10))
print(cohorts.tail(12))
Unfortunately, this dataset contains only the first month of the cohort. Based on these results, we can already compare cohorts. For example, the 2018-06 and 2018-07 cohort: The total number of customers and purchases has increased. But, let’s count the purchases per customer:
- 2018-06 – 6419/6167 = 1.04
- 2018-07 – 6507/6292 = 1.03
Conclusion: on average, customers began to buy less often. For homework, write code that counts purchases per customer.
Now let’s calculate the average number of orders, customers and the average amount for each year.
cohorts = cohorts.reset_index()
cohorts['CohortGroupYear'] = cohorts['CohortGroup'].apply(lambda x: x.split('-')[0])
tt = cohorts.groupby('CohortGroupYear').agg({'payment_value': 'mean','TotalOrders':'mean','TotalClients':'mean'})
tt['ratio'] = tt['TotalOrders'] / tt['TotalClients']
print(tt)
For 2016, unfortunately, we have only 3 months with very strange values, and for 2018, only 10 months. We are already seeing growth in 2018 compared to 2017, but purchases per customer are declining.
Conclusion
Analyzed cohort analysis. It allows us to compare different groups of customers united by some criteria over time and understand how different versions of the product affect it.