Leveraging Data and Feature Engineering for Credit Risk Assessment: A Comprehensive Analysis

4 min readApr 1, 2024

Introduction

Credit risk assessment plays a pivotal role in financial institutions, enabling them to evaluate the creditworthiness of potential borrowers and minimize the likelihood of default. In this project, we embark on a journey to develop a predictive model that can effectively identify individuals at a higher risk of defaulting on their credit obligations. Leveraging advanced machine learning algorithms and rigorous feature engineering, we aim to enhance the accuracy and reliability of our credit risk assessment framework.

Data Reading and Preprocessing

Our journey begins with the acquisition of raw data comprising information about borrowers’ credit profiles. We utilize the pandas library in Python to read the data from CSV files and perform initial data exploration. The preprocessing stage involves handling missing values, duplications, and outlier detection.

import pandas as pd

# Read the training and test data
df_train = pd.read_csv("cs-training.csv").drop(['Unnamed: 0'], axis=1)
df_test = pd.read_csv("cs-test.csv").drop(['Unnamed: 0'], axis=1)

# Display the training data
print("Training data:")
print(df_train.head())

# Display the test data
print("Test data:")
print(df_test.head())

Handle Missing Values

We used various strategies to handle missing values, including imputation techniques based on statistical measures such as mean, median, and mode. Additionally, we analyze the distribution of missing values across different features and devise tailored approaches to address them effectively.

# Handle missing values
def findMiss(df):
    return round(df.isnull().sum() / df.shape[0] * 100, 2)

missing_percentage = findMiss(df_train)
print("Percentage of missing values:")
print(missing_percentage)

Feature Engineering

Feature engineering emerges as a crucial aspect of our project, as it involves the transformation and creation of new features to improve the predictive performance of our model. We delve into the following key feature engineering techniques:

1. Imputation of Missing Values

By carefully imputing missing values in critical features such as monthly income and number of dependents, we ensure that our dataset is robust and representative of the underlying population. Missing values in crucial features such as ‘NumberOfDependents’ and ‘MonthlyIncome’ can significantly impact model performance. We employ various strategies to handle missing values effectively:

# Impute missing values with appropriate techniques
# For 'NumberOfDependents', we fill missing values with 0 as it signifies no dependents.
# For 'MonthlyIncome', we fill missing values with 0 initially, then with median values for non-missing values.
fam_miss['NumberOfDependents'] = fam_miss['NumberOfDependents'].fillna(0)
fam_miss['MonthlyIncome'] = fam_miss['MonthlyIncome'].fillna(0)
fam_nmiss['MonthlyIncome'] = fam_nmiss['MonthlyIncome'].fillna(fam_nmiss['MonthlyIncome'].median())

2. Handling Outliers

Identification and handling of outliers in features such as revolving utilization of unsecured lines and debt ratio are essential to prevent skewness and improve the model’s generalization capability.

Outliers, if left unaddressed, can distort the model’s learning process and adversely affect predictive accuracy. We identify and handle outliers in critical features such as ‘RevolvingUtilizationOfUnsecuredLines’, ‘NumberOfTimes90DaysLate’, and ‘DebtRatio’:

# Handling outliers
# Remove outliers in RevolvingUtilizationOfUnsecuredLines
df_train = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] <= 10]
util_dropped= filled_train.drop(filled_train[filled_train.RevolvingUtilizationOfUnsecuredLines>10].index)
util_dropped.groupby(util_dropped['NumberOfTimes90DaysLate']).size()

3. Balancing the Target Variable

The imbalance in the target variable (‘SeriousDlqin2yrs’) can bias the model towards the majority class. To address this, we employ oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to balance the distribution and enhance the model’s ability to learn from minority class instances.

# Balancing the target variable
from imblearn.over_sampling import SMOTE 

X_train_res, y_train_res = SMOTE(random_state=2).fit_resample(X_train, y_train.ravel())

These feature engineering techniques are crucial in preparing a robust dataset for model training. By handling missing values, addressing outliers, and balancing the target variable, we ensure that our predictive model learns meaningful patterns from the data, ultimately enhancing its predictive performance and generalization capability.

Model Building and Evaluation

With a meticulously curated dataset and enriched feature set, we proceed to build and train our predictive model. We opt for the XGBoost algorithm, known for its robustness and efficiency in handling complex datasets.

# Model Building
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Instantiate XGBoost classifier
model = XGBClassifier(tree_method='exact')

# Separate features and target variable
X = df_train.drop(['SeriousDlqin2yrs'], axis=1)
y = df_train['SeriousDlqin2yrs']

# Fit the model
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Evaluate the model
accuracy = accuracy_score(y, y_pred)
conf_matrix = confusion_matrix(y, y_pred)
class_report = classification_report(y, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

After training the XGBoost classifier on the training data and evaluate its performance using metrics such as accuracy, precision, recall, and F1-score. We also visualize the model’s performance using confusion matrices to gain deeper insights into its predictive capabilities.

Business Implications

Beyond the technical intricacies, it is imperative to underscore the business implications of our credit risk assessment framework. By accurately identifying individuals at a higher risk of default, financial institutions can proactively mitigate potential losses, optimize lending strategies, and uphold regulatory compliance standards. Moreover, the integration of advanced analytics into credit risk assessment processes empowers businesses to foster trust and credibility among stakeholders while driving sustainable growth and profitability.