Data Science, ML and Analytics Engineering

Pandas for Data Science

With this note, I am launching a series of articles for beginners in Data Science and Machine Learning. We’ll start by exploring Pandas. While there are many articles on Pandas available online, I want to focus on practical techniques for using Pandas in Data Science projects and model building.

Dataset: We will use the German Credit Risk dataset from Kaggle.

The dataset contains information on credit data:

  • Age
  • Sex
  • Job
  • Housing
  • Saving accounts
  • Checking account
  • Credit amount
  • Duration
  • Purpose
Pandas for Data Science

Read more

Calculating Monthly Recurring Revenue (MRR) in Python

What is Monthly Recurring Revenue?

Monthly Recurring Revenue – regular monthly income. This metric is used primarily in subscription models. In this case, the income itself must be reduced to months.

Why is it valuable?

If we have a subscription service, we have regular or periodic payments, then we can understand how much money we will earn and how effective our business is. Further, we can increase MRR by switching customers to a more expensive tariff or try to reduce customer churn.

The problem

For this task, use the new dataset: https://alimbekov.com/wp-content/uploads/2021/03/mrr.csv

Structure:

  • customer_id – already familiar customer ID
  • first_order – Subscription start date
  • EndDate – Subscription end date
  • rate – subscription plan (monthly, semi-annual, annual)
  • Amount – amount paid
  • commission – payment system commission

We will use the following formula to calculate MRR: MRR = new + old + expansion + reactivation – churn – contraction

  1. new MRR – the first payment of a new client
  2. old MRR – recurring customer payment
  3. expansion MRR – increase in MRR due to the new tariff
  4. contraction MRR – decrease in MRR due to the new tariff
  5. churn MRR — MRR outflow due to termination of payment
  6. reactivation MRR – return of a client who had an outflow of MRR

Read more

Cohort Analysis in Python

Когортный анализ
Cohort Analysis

What is cohort analysis?

Cohort analysis consists in studying the characteristics of cohorts / vintages / generations, united by common temporal characteristics..

A cohort/vintage/generation is a group formed in a specific way based on time: for example, the month of registration, the month of the first transaction, or the first visit to the site. Cohorts are very similar to segments, with the difference that a cohort includes groups of a certain period of time, while a segment can be based on any other characteristics.

Why is it valuable?

This kind of analysis can be helpful when it comes to understanding the health of your business and the stickiness of your customers. Stickiness is critical, as it is much cheaper and easier to retain a customer than it is to acquire new ones. Also, your product evolves over time. New features are added and removed, design changes, etc. Observing individual groups over time is the starting point for understanding how these changes affect user/group behavior.

Read more

RFM analysis in Python

The problem

Make an RFM analysis. It divides users into segments depending on the prescription (Recency), frequency (Frequency) and the total amount of payments (Monetary).

  • Recency – the difference between the current date and the date of the last payment
  • Frequency — number of transactions
  • Monetary – amount of purchases

These three indicators must be calculated separately for each customer. Then put marks from 1-3 or 1-5. The wider the range, the narrower segments we get.

Points can be set using quantiles. We sort the data according to one of the criteria and divide it into equal groups.

For this task, we use the public dataset: https://www.kaggle.com/olistbr/brazilian-ecommerce nd the olist_orders_dataset.csv and olist_order_payments_dataset.csv files. You can connect them order_id.

Read more