Data Science, ML and Analytics Engineering

Pandas for Data Science

With this note, I am launching a series of articles for beginners in Data Science and Machine Learning. We’ll start by exploring Pandas. While there are many articles on Pandas available online, I want to focus on practical techniques for using Pandas in Data Science projects and model building.

Dataset: We will use the German Credit Risk dataset from Kaggle.

The dataset contains information on credit data:

  • Age
  • Sex
  • Job
  • Housing
  • Saving accounts
  • Checking account
  • Credit amount
  • Duration
  • Purpose
Pandas for Data Science

Working with Jupyter Notebook

Before diving into Pandas for complete beginners, I’d like to discuss working in the Jupyter Notebook environment briefly. When you execute code in a Jupyter Notebook, the results are stored in memory. Even if you delete a cell, the data it created remains accessible until you restart the kernel.

In Jupyter Notebook, there are two main types of cells:

  1. Code: Used for writing and executing Python code.
  2. Markdown: Used for formatting text, creating headers, lists, links, and other formatted text.

Jupyter Notebook offers keyboard shortcuts to speed up your workflow. For example:

  • Shift + Enter: Execute the current cell and move to the next one.
  • Ctrl + Enter: Execute the current cell and stay in it.
  • B: Add a new cell below the current one.
  • A: Add a new cell above the current one.

Python Basics for Pandas

Now, let’s cover some useful Python principles and constructs for working with Pandas.

Loops

Loops in Python, such as for and while, allow you to perform the same operations multiple times. However, using loops in Pandas is often inefficient, especially for large datasets. Directly using loops in Pandas can significantly slow down code execution. For example, consider the task of lemmatizing text (reducing words to their base form):

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

lemmas = []
for text in data['Purpose']:
    lemmas.append(lemmatize(text))
data['Lemmatized'] = lemmas

Now, the same example using the apply() method:

data['Lemmatized'] = data['Purpose'].apply(lemmatize)

Using apply() works faster and requires less code since Pandas optimizes the execution of internal operations.

Functions

Functions in Python are key tools for reusing code. They allow you to break down code into smaller parts, each performing a specific task.

def Is used to define named functions, which can contain multiple statements and perform complex operations. Suppose we need to convert clients’ ages (column Age) into age groups:

def age_group(age):
    if age < 30:
        return 'Young'
    elif 30 <= age < 60:
        return 'Middle-aged'
    else:
        return 'Senior'

data['Age Group'] = data['Age'].apply(age_group)

lambda is used to create anonymous functions, which usually contain a single expression. For example, suppose we need to add 10% to each value in the Credit amount column:

data['Adjusted Credit Amount'] = data['Credit amount'].apply(lambda x: x * 1.1)

def предпочтительнее использовать, когда функция сложна или требует документирования, а lambda полезен для простых одноразовых задач.

try-except

The try-except block helps prevent errors from interrupting program execution. Suppose we try to convert the values in the Credit amount column to integers, but some values might be incorrect (e.g., strings). We use try-except to handle possible errors:

def safe_convert_to_int(value):
    try:
        return int(value)
    except ValueError:
        return None 

data['Credit amount (int)'] = data['Credit amount'].apply(safe_convert_to_int)

In this example, the safe_convert_to_int() function tries to convert a value to an integer. If an error occurs (e.g., if the value is not a number), try-except catches the error, and the function returns None.

Pandas for Data Science

And finally, we have arrived at Pandas. Let’s quickly go over the basics.

Pandas allows you to load data from files with various delimiters (field separators). By default, pd.read_csv() it assumes that the delimiter is a comma. If the data is separated by other characters, such as semicolons, tabs, or spaces, you need to specify this when loading the file explicitly:

df_semicolon = pd.read_csv('data_semicolon.csv', sep=';')

df_tab = pd.read_csv('data_tab.csv', sep='\\t')

df_space = pd.read_csv('data_space.csv', sep=' ')

File encoding is crucial, especially when dealing with data containing special characters or text in other languages. If the file is not encoded in UTF-8, you need to specify its encoding:

df_iso = pd.read_csv('data_iso.csv', encoding='ISO-8859-1')
df_utf16 = pd.read_csv('data_utf16.csv', encoding='utf-16')

Data Overview Functions

After loading the data, it’s important to understand its structure and quality quickly. Pandas provides several useful functions for this:

  • .info() provides a summary of the DataFrame: the number of records, number of columns, their names, data types, and the number of non-null values.
  • .describe() outputs statistical summaries of the numeric columns: mean, standard deviation, minimum and maximum values, quartile values.
  • .head() shows the first 5 rows of the DataFrame. You can specify the number of rows to display.
  • .tail() shows the last 5 rows of the DataFrame.
  • .sample() selects random rows from the DataFrame, useful for a quick data overview.
  • .shape() returns the dimensions of the DataFrame, i.e., the number of rows and columns.

Data Types

Casting data types in Pandas is important for several reasons:

  • Memory Optimization: Some data types take up less memory. For example, casting columns with text values to the category type significantly reduces the memory usage compared to the object type.
  • Faster Computation: Working with numeric data types (e.g., int, float) is faster and more efficient than with the object type. This is especially important for large datasets.
  • Correctness of Operations: Type casting helps avoid errors when performing operations on data. For example, comparison or arithmetic operations might not work correctly if the data type is incorrect.

Binary variables are those that can only take two values, such as 0 and 1, True and False, or other equivalents (e.g., Yes and No):

data['Is Male'] = data['Sex'].apply(lambda x: 1 if x == 'male' else 0)

In this example, a new column Is Male is created, which contains 1 if the customer is male, and 0 if female. Binary variables are often used in modeling to indicate the presence or absence of a feature.

The category data type has several advantages over object:

  • Data Understanding: Casting to category helps explicitly indicate categorical features, which is important for analysis and modeling.
  • Reduced Memory Usage: If a column has a small number of unique values, the category type stores these values as categories rather than strings, reducing memory usage.
  • Faster Operations: Comparison, sorting, and grouping operations are faster with the category type due to its optimized internal data representation.
data['Sex'] = data['Sex'].astype('category')

int and float Data Types

The int and float data types differ as follows:

  • int (integers): Used to represent whole numbers with no fractional part. Takes up less memory than float.
  • float (floating-point numbers): Used to represent numbers with a fractional part. Takes up more memory and may be less accurate due to how floating-point numbers are stored.

When to use int and float?

  • Use int if the data will always be whole numbers (e.g., the number of items, age).
  • Use float if fractional values are possible (e.g., prices, percentages).

The pd.to_numeric() method is used to cast data to numeric types (e.g., int or float). This is useful if the data contains strings that can be converted to numbers.

data['Credit amount'] = pd.to_numeric(data['Credit amount'], errors='coerce')

If there is a value in the column that cannot be converted to a number, the method will replace it with NaN (if the errors='coerce' parameter is used). This helps to avoid errors during analysis and calculation.

Accessing Columns

In Pandas, you can access DataFrame columns in two main ways: via dot notation (.) and square brackets ([]). Both methods have their own features and limitations.

It doesn’t work if the column name matches a method or attribute of the DataFrame.

Dot Notation: This method is convenient and compact, but it has some limitations:

It only works for columns whose names do not contain spaces or special characters.

data.Age

Here, data.Age returns the Age column as a Series.

Square Brackets: This is a more universal method as column names do not limit it:

  • It supports accessing columns with any names, including those containing spaces, special characters, or matching DataFrame methods/attributes.
  • You can use it to access multiple columns at once (a list of columns is passed).

Examples:

# Accessing a single column
data['Age']

# Accessing a column with a name containing a space
data['Saving accounts']

# Accessing multiple columns
data[['Age', 'Credit amount']]

Indexing

Pandas offers several methods for indexing and selecting data from a DataFrame:

  • iloc is used for selecting data by their positions (numeric indexes). It works for both rows and columns.

Examples:

# Selecting the first row (indexing starts from 0)
data.iloc[0]

# Selecting the first five rows
data.iloc[:5]

# Selecting the value in the first row and second column
data.iloc[0, 1]

# Selecting the first three rows and the first two columns
data.iloc[:3, :2]

loc Is used to select data by labels (names of rows and columns).

Examples:

# Selecting the row with index 0
data.loc[0]

# Selecting all rows where the value in the 'Age' column equals 30
data.loc[data['Age'] == 30]

# Selecting the value in the row with index 0 and the 'Age' column
data.loc[0, 'Age']

# Selecting the first three rows and the 'Age' and 'Credit amount' columns
data.loc[:2, ['Age', 'Credit amount']]

When using [] to access rows (like data[0:5]), Pandas returns rows with numbers from 0 to 4 (selection by row index). When accessing a column (like data['Age']), the specified column is returned.

Examples:

# Selecting the first five rows
data[0:5]

# Selecting the 'Age' column
data['Age']

Data Filtering

Data filtering allows you to select rows that meet certain conditions.

The .drop() method is used to delete rows or columns. You can specify the axis for deletion (axis=0 for rows, axis=1 for columns).

Examples:

# Deleting the 'Age' column
data.drop('Age', axis=1)

# Deleting rows with indexes 0, 1, and 2
data.drop([0, 1, 2], axis=0)

The .query() method allows you to filter rows based on logical conditions written as a string.

Example:

# Selecting all rows where 'Age' is greater than 30 and 'Credit amount' is less than 5000
data.query('Age > 30 & `Credit amount` < 5000')

You can use .loc[] to filter rows based on a condition:

Example:

# Selecting all rows where 'Age' is greater than 30
data.loc[data['Age'] > 30]

In Pandas, most methods, like .drop(), .fillna(), .replace() Have an inplace parameter, which, if set to True, modifies the original DataFrame directly instead of creating a copy.

Example of using inplace:

# Deleting a column using inplace
data.drop('Age', axis=1, inplace=True)

Dangers of inplace:

  • Loss of Flexibility: Without inplace, you can keep both the original data and the new data, allowing you to compare the results or revert to the original data if something goes wrong.
  • Irreversibility of Changes: When using inplace=True, changes to the DataFrame are made in place, and if you accidentally delete or modify data, it cannot be easily undone.
  • Problems with Method Chaining: If you use multiple methods in a chain, inplace=True can cause errors or unexpected behavior, as the modified DataFrame may no longer meet the expected conditions.

Grouping and Sorting Data

The .agg() method is used to perform various aggregate functions on data grouped by one or more columns. You can apply multiple aggregate functions to several columns simultaneously.

Example of using .agg():

# Grouping data by the 'Sex' column and aggregating using multiple metrics
aggregated_data = data.groupby('Sex').agg({
    'Age': ['mean', 'min', 'max'],     # Calculate mean, minimum, and maximum for the 'Age' column
    'Credit amount': ['sum', 'mean']   # Calculate sum and mean for the 'Credit amount' column
})

As a result, you’ll get a DataFrame where the specified aggregate functions are calculated for each group (Sex) for the corresponding columns.

Pandas for Data Science Grouping and sorting data

The .sort_values() method allows you to sort data by multiple columns at once, with the option to specify the sorting order for each column.

Example:

# Sorting data by 'Age' in ascending order and 'Credit amount' in descending order
sorted_data = data.sort_values(by=['Age', 'Credit amount'], ascending=[True, False])

Here, the rows will be sorted first by ascending age and then by descending credit amount.

Categorization

The .map() method allows you to replace values in a column based on a provided mapping (dictionary) or function.

Examples:

# Replacing values in the 'Sex' column with numerical values
data['Sex'] = data['Sex'].map({'male': 1, 'female': 0})

# Transforming values in the 'Age' column using a function
data['Age group'] = data['Age'].map(lambda x: 'young' if x < 30 else 'adult')

The .cut() method divides numerical data into intervals, creating a new categorical variable.

Example:

# Dividing age into three groups
data['Age group'] = pd.cut(data['Age'], bins=[0, 30, 50, 100], labels=['young', 'middle-aged', 'senior'])

Here, age is divided into three categories: young (0 to 30 years), middle-aged (30 to 50 years), and senior (50 to 100 years).

The .cut() method

Categorization by String Matching

Sometimes, you need to create a new category based on the presence of a specific word in a string. This can be done using Pandas methods.

Example:

# Creating a new column that indicates whether the 'Purpose' column contains the word 'car'
data['Car related'] = data['Purpose'].apply(lambda x: 'car' in x.lower())

In this example, the Car related column will have True if the word “car” appears in the Purpose column, and False otherwise.

Conditional Data Replacement

The np.where() function allows you to replace values based on a condition.

Example:

import numpy as np

# Replacing values in the 'Age' column with 'young' if age is less than 30, otherwise 'adult'
data['Age category'] = np.where(data['Age'] < 30, 'young', 'adult')

The np.select() function allows you to set multiple conditions for replacing values.

Example:

conditions = [
    (data['Age'] < 30),
    (data['Age'] >= 30) & (data['Age'] < 50),
    (data['Age'] >= 50)
]

choices = ['young', 'middle-aged', 'senior']

# Assigning values based on which condition is met
data['Age group'] = np.select(conditions, choices, default='unknown')

Here, np.select() checks the conditions and replaces values in the Age group column based on which condition is met.

You can use a standard if-else expression within a function and apply it to the data using the .apply() method.

Example:

# Defining a function
def age_category(age):
    if age < 30:
        return 'young'
    elif age < 50:
        return 'middle-aged'
    else:
        return 'senior'

# Applying the function to the 'Age' column
data['Age group'] = data['Age'].apply(age_category)

The pd.where() method leaves values unchanged if a condition is met and replaces them if the condition is not met.

Example:

# Replacing all values in the 'Age' column where age is greater than 30 with '30+'
data['Age'] = data['Age'].where(data['Age'] > 30, '30+')

Here, all values in the Age column where age is greater than 30 will be replaced with the string ’30+’, while other values will remain unchanged.

Dealing with Bad Data

When analyzing data, situations often arise where certain values are anomalous or erroneous. In a study conducted using a nested dataset, it was found that the average work experience of two categories of clients significantly exceeded normal values. This raised a valid question: how can these data be correctly processed to improve the quality of analysis and make the results more reliable?

Causes of Data Anomalies:

  • Data Entry Errors: Mistakes during data entry may result in incorrect values. For example, 50 years of work experience might have been recorded instead of 5 years.
  • Validation Rule Violations: If the data collection system lacks adequate checks, incorrect values may appear in the final dataset.
  • Rare Events: An anomaly may reflect a real but rare situation. For instance, some clients may have a long work experience, though there are few cases.
  • System Failures: System errors during data collection can lead to incorrect values.

Approaches to Handling Anomalous or Erroneous Data:

  • Removing Anomalous Values: If a value is clearly incorrect and cannot be corrected, it can be removed from the dataset. However, this method may lead to loss of important information.
  • Replacing Values with Mean or Median: If anomalous values significantly distort the overall picture, they can be replaced with the mean or median values for the relevant category.
  • Using Machine Learning Methods: Some algorithms can automatically detect and handle anomalies, leaving only meaningful data for analysis.
  • Correcting Values Based on Business Rules: For example, if it’s known that the maximum work experience in an industry does not exceed 40 years, all values exceeding this limit can be replaced with 40.

Practical Application:

In practice, combined methods are often used. First, a detailed analysis of anomalous data is conducted to investigate possible causes and their impact on the final result. Depending on the context, a decision is made on removing, correcting, or retaining such data.

If the anomaly is linked to a real but rare situation, it may be kept in the data but flagged as an exception. The data are corrected or removed if the anomaly is due to an input error or system failure.

Example of Identifying and Handling Anomalous Values:

Anomalous values can be identified using descriptive statistics. For example, values that exceed 3 standard deviations from the mean can be considered anomalous.

import numpy as np

# Calculating the mean and standard deviation
mean_duration = data['Duration'].mean()
std_duration = data['Duration'].std()

# Defining boundaries for anomalous values
upper_bound = mean_duration + 3 * std_duration
lower_bound = mean_duration - 3 * std_duration

# Filtering anomalous values
anomalies = data[(data['Duration'] > upper_bound) | (data['Duration'] < lower_bound)]

After identifying anomalies, they can be handled in various ways, such as replacing them with the median value.

Search for anomalous data
# Replacing anomalous values with the median value
median_duration = data['Duration'].median()
data['Duration'] = np.where((data['Duration'] > upper_bound) | (data['Duration'] < lower_bound),
                            median_duration, data['Duration'])

Handling Missing Data

Missing data can occur for various reasons: errors during data collection, lack of information, or technical failures. It is important to consider that missing values can significantly impact analysis results, so they must be handled correctly. Missing values should be replaced if:

  • They are significant in number: If there are many missing values in a column, ignoring them might skew the analysis.
  • They are in key variables: If the missing values are in variables critical to the analysis, they should be replaced to maintain data integrity.
  • The data is needed for further modeling. In machine learning, complete data is often necessary, so missing values must be filled.

Methods for Replacing Missing Values depend on the data type and context:

  • Replacing with Mean or Median: Used for numerical data if the data distribution is symmetric or has outliers.
  • Replacing with Mode (Most Frequent Value): Suitable for categorical data.
  • Replacing Based on Data from Other Columns: If salary data is missing, the position could be filled based on the job title listed in another column.
  • Using Models to Predict Missing Values: A more complex method where missing values are filled using predictive models.

Pandas Commands for Replacing Missing Values:

data.fillna(value)  # Replaces all missing values in the DataFrame with the specified value.

data['column'].fillna(data['column'].mean(), inplace=True)  # Replaces missing values in a specific column with the mean.

data['column'].fillna(data['column'].median(), inplace=True)  # Replaces missing values with the median.

data.fillna(method='ffill')  # Fills missing values with the previous value.

data.fillna(method='bfill')  # Fills missing values with the next value.
missing_values = data.isnull().sum()
print(missing_values)

After running the code, we see that the “Saving accounts” and “Checking account” columns are missing values. Let’s explore different approaches to handling the missing values in the “Checking account” column.

Replacing with the Most Frequent Value (Mode): If one category is more common than others, the missing values can be replaced with it.

# Replacing missing values with the most frequent value
most_frequent = data['Checking account'].mode()[0]
data['Checking account'].fillna(most_frequent, inplace=True)

Replacing with a Special Value: Missing values can be replaced with a specific value, such as “Unknown” or “No account,” if that makes sense in the data context.

# Replacing missing values with 'Unknown'
data['Checking account'].fillna('Unknown', inplace=True)

Replacing Based on Data from Other Columns: If there are other columns in the dataset that may correlate with the presence or type of account, these data can be used to fill the missing values.

data['Checking account'] = data.groupby('Job')['Checking account'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

Handling Duplicates in Data

Duplicates in data are records that are entirely or partially repeated. The presence of duplicates can distort the results of analysis, so it is important to identify and remove them in a timely manner.

In Pandas, the duplicated() method is used to find duplicates. It returns a boolean array (True or False) for each row, indicating whether that row is a duplicate.

data.duplicated(keep=False) — marks all rows as duplicates, including both the first and last occurrences.

data.duplicated() — returns True for all rows that are duplicated (including the first occurrence).

data.duplicated(keep='first') — marks all rows as duplicates except for the first occurrence.

data.duplicated(keep='last') — marks all rows as duplicates except for the last occurrence.

duplicates = data[data.duplicated()]
print(f'Number of duplicates: {duplicates.shape[0]}')

# Viewing duplicates
print(duplicates)

After identifying duplicates, you can choose from several methods to handle them:

Removing Duplicates: If duplicates do not contain unique information, they can be removed.

# Removing all duplicates, keeping the first occurrence
data = data.drop_duplicates(keep='first')

# Checking the number of rows after removing duplicates
print(f'Number of rows after removing duplicates: {data.shape[0]}')

Aggregating Data: If duplicates have different values in some fields, they can be combined, for example, by taking the mean or sum.

# Example of aggregation (taking the mean for numerical columns)
df_aggregated = data.groupby(data.columns.tolist(), as_index=False).mean()

Marking and Further Analysis: In some cases, duplicates can be retained but flagged for consideration in further analysis.

# Adding a flag for duplicates
data['is_duplicate'] = data.duplicated(keep=False)

Duplicates can occur for various reasons:

  • Lack of Unique Identifiers: If the data lacks unique identifiers (e.g., customer IDs), the system may not recognize records as unique, even if they represent different people or events.
  • Data Collection Errors: For example, if the same client was entered into the system twice by mistake.
  • Merging Multiple Data Sources: The same records may appear multiple times when combining data from different sources.
  • Repeated Data Loading: Sometimes, data may be loaded again, resulting in duplicates.

Conclusion

In this note, I aimed to cover various use cases of Pandas in Data Science projects, providing practical techniques and examples. All of this will be useful in upcoming articles in the series designed for beginners entering the fields of Data Science and Machine Learning.

Share it

If you liked the article - subscribe to my channel in the telegram https://t.me/renat_alimbekov


Other entries in this category: