1. Introduction
·
The " The Loan
Detective: Identifying Risks and Anomalies in Financial Data" aims to evaluate the risk associated with loan
applicants using a comprehensive dataset containing customer and loan details.
In the financial industry, accurately assessing loan risk is crucial for
minimizing potential losses and ensuring the stability of lending institutions.
This project leverages SQL to explore and analyze key risk factors, providing
insights into customer profiles, creditworthiness, and potential default risks.
The dataset used in this project includes a range of attributes such as customer demographics, loan amounts, credit scores, and repayment history. By examining these factors, we aim to identify high-risk customers, understand the characteristics of defaulted loans, and propose strategies for improving loan approval processes.
The project not only focuses on the technical aspects of data analysis but also incorporates an auditing perspective to ensure data integrity and compliance. This involves checking for data completeness, consistency, and identifying any anomalies or irregularities. The findings from this project will be valuable for internal auditors, risk managers, and decision-makers in the banking sector, providing them with actionable insights to enhance their risk management strategies.
2. Dataset Description
Data Source: The
dataset from Kaggle contains information on bank loan applications and their
statuses, intended for credit scoring and risk analysis. It includes features
like credit score, loan amount, term, income, and more information useful for
evaluating applicants' creditworthiness.
· Loan ID - A unique
identifier for each loan.
· Customer ID - A
unique identifier for each customer.
· Loan Status - The
current status of the loan (e.g., Fully Paid, Charged Off).
· Current Loan Amount
- The amount currently owed on the loan.
· Term - The term or duration
of the loan (e.g., Short Term, Long Term).
· Credit Score - The credit
score of the customer.
· Annual Income - The
customer's annual income.
· Years in current job - The
number of years the customer has been in their current job.
· Home Ownership - The
customer's home ownership status (e.g., Rent, Own Home, Home Mortgage).
· Purpose - The purpose for
which the loan was taken (e.g., Debt Consolidation, Home Improvements).
· Monthly Debt - The
customer's monthly debt obligations.
· Years of Credit History
- The number of years the customer has had credit.
· Months since last delinquent
- The number of months since the customer last had a delinquency.
· Number of Open Accounts -
The number of open credit accounts the customer has.
· Number of Credit Problems -
The number of credit-related issues (e.g., late payments, defaults).
· Current Credit Balance - The
customer's current outstanding balance on all credit accounts.
· Maximum Open Credit - The
maximum amount of credit available to the customer.
· Bankruptcies - The number of
bankruptcies filed by the customer.
· Tax Liens - The number of
tax liens on the customer's assets.
3. Objective
The objective of this project is to analyze a
dataset containing various attributes related to bank loans, including loan
status, customer information, credit score, and financial history. The primary
focus is to assess the risk associated with each loan by identifying key
indicators of creditworthiness and potential default. This analysis aims to
provide insights into patterns and trends that can assist in making informed
lending decisions and improving the overall credit risk management process. The
project will also highlight important metrics that influence loan outcomes and
recommend strategies for mitigating risk.
4. Methodology
4.1 Data Acquisition
Dataset Source: Obtain the dataset from the relevant source, ensuring it
includes all necessary attributes related to bank loans, customer information,
credit score, and financial history.
Database Setup: A new database named CREDIT RISK was created using
Microsoft SQL Server Management Studio.
4.2
Data
Preparation
Data
Cleaning: During
the initial data exploration, missing values were detected in several columns,
including sensitive fields such as Bankruptcies, Credit Score, and Months Since
Last Delinquent.
Approach for Handling Missing
Values: In columns like Bankruptcies and
Credit Score, missing values were retained as-is due to regulatory constraints
and to ensure data integrity. For example: In Bankruptcies Column null values were not imputed,
preserving the original state of the data. The Credit Score Column Entries marked as "N/A" or
missing were left unchanged.
Standardizing Formats: Monetary values across columns such as Current Loan Amount
and Monthly Debt were checked for consistent formatting (e.g., no mixed
currency symbols or decimal separators). Credit-related fields like Credit
Score were normalized to a consistent numerical range where possible.
Data Import: Using SQL commands the data was imported into the CREDIT
RISK database. The data is correctly imported and structured into tables with
appropriate data types.
4.3 Risk Assessment
Credit Score
The
credit score is a numerical measure of a borrower’s creditworthiness, derived
from their credit history. This score plays a crucial role in assessing the
risk associated with lending to an individual. A lower credit score is
indicative of higher risk, as it suggests a greater likelihood of default. For
instance, borrowers with credit scores below a certain threshold, such as 600,
are generally categorized as high-risk. This is because a lower score often
reflects a history of financial difficulties or poor credit management, which
increases the probability of missed payments or defaults on the loan.
Debt-to-Income
The
Debt-to-Income (DTI) ratio is a metric that assesses a borrower’s total monthly
debt payments in relation to their gross monthly income. This ratio is
instrumental in evaluating the borrower’s ability to manage additional debt. A
higher DTI ratio indicates that a borrower is carrying a substantial amount of
debt relative to their income, which can heighten the risk of default. For
instance, a DTI ratio exceeding 40% is commonly viewed as high risk, as it
signifies that a significant portion of the borrower’s income is already
committed to existing debt obligations, leaving less capacity to handle new
loan payments.
Current Loan
Amount:
The
current loan amount refers to the outstanding balance of the loan at any given
time. This figure is crucial in assessing loan risk, as larger loan amounts can
signal increased risk, particularly if they surpass the borrower’s ability to
repay based on their financial profile. When a borrower has a substantial loan
balance relative to their income and financial stability, it may strain their
resources and elevate the likelihood of default. Thus, higher loan amounts can
indicate greater financial pressure and potential repayment issues.
Annual Income
The annual
income of a borrower is a key factor in evaluating their capacity to fulfill
loan obligations. It provides insight into their overall financial stability
and ability to manage loan repayments. When the annual income is relatively low
compared to the loan amount and monthly debt payments, it can heighten the risk
of default. A lower income means that a borrower may struggle to cover both
existing debts and new loan payments, increasing the likelihood of repayment
issues. Thus, a lower annual income relative to debt obligations can signal
greater risk for lenders.
Years
of Credit History
Years of credit history measures the
length of time a borrower has been using credit. This metric is important for
assessing risk, as a shorter credit history may be associated with higher risk.
Limited credit behavior data can make it challenging to evaluate the borrower’s
creditworthiness and repayment reliability. A shorter credit history provides
less insight into the borrower’s long-term financial behavior and credit
management, which can increase the uncertainty and potential risk of default.
Therefore, borrowers with less established credit histories might be viewed as
higher risk due to the lack of extensive credit performance data.
Number of Credit Problems
The number of credit problems encompasses
issues such as missed payments, defaults, or bankruptcies. This measure is
crucial for assessing risk, as a higher count of credit problems signals a
greater likelihood of default. Each instance of a credit problem reflects past
difficulties in meeting financial obligations, which can predict future
repayment issues. Consequently, borrowers with a history of multiple credit
problems are considered higher risk, as their past financial behavior suggests
a greater probability of continuing difficulties in managing debt.
How
can we segment the customers on the basis of their risk levels?
To
assess and manage loan risk, customers can be segmented into three distinct
risk levels based on several criteria.
·
High Risk customers
typically have low credit scores, high debt-to-income (DTI) ratios, large loan
amounts relative to their income, and multiple credit problems. These borrowers
are more likely to default, and risk mitigation strategies for them might
include imposing higher interest rates or additional requirements to safeguard
against potential losses.
·
Medium
Risk
customers generally exhibit moderate credit scores and manageable DTI ratios,
with some credit issues but not severe. They present a moderate risk,
necessitating closer monitoring and periodic reviews of their financial
situation to prevent potential defaults.
·
Low Risk customers have
high credit scores, low DTI ratios, and clean credit histories, indicating they
are less likely to default. These borrowers may be offered better loan terms
and conditions due to their favorable risk profile, reflecting their strong
financial stability and reliability.
4.4 Query Development
Data Consistency Checks
This
query identifies the number of records with missing values in critical fields
like Credit_Score, Current_Loan_Amount,
Annual_Income, and Number_of_Credit_Problems.
It also calculates the percentage of such records relative to the total
dataset. Handling missing values is crucial for ensuring the accuracy and
completeness of the data.
In
the dataset, about 19.57% of the data is missing, totaling 19,668 missing values.
This means that almost one-fifth of the information isn’t available, which
could affect the accuracy and reliability of any analysis we perform.
· Validating Data Ranges
With 4,551 records falling outside the expected ranges, this indicates a significant portion of the dataset may contain errors or inconsistencies. These out-of-range values could suggest issues such as data entry mistakes, anomalies, or problems with the data collection process. Addressing these records is crucial for improving data quality and ensuring that your analyses are based on accurate and reliable information.
· Check for Logical Consistency
The presence of 11,486 records in the credit_train dataset where the Current_Loan_Amount exceeds the Annual_Income highlights a significant
data inconsistency issue. This large number of inconsistencies suggests
potential errors in data entry or collection, which could distort credit risk
assessments and lead to inaccurate conclusions about borrowers'
creditworthiness. Such discrepancies might also indicate possible fraudulent
activity or deliberate misrepresentation, warranting further investigation.
Additionally, the high volume of inconsistencies underscores the need for
improved data validation and quality assurance processes to prevent similar
issues in the future. Addressing these concerns is crucial for maintaining the
reliability of credit risk analyses and ensuring the accuracy of the dataset.
· Check for Duplicate Entries
The detection of 18,002 duplicate entries in
the credit_train dataset
reveals a major data redundancy issue. These duplicates can skew analysis
results by inflating data counts and distorting conclusions, thereby impacting
the accuracy of credit risk assessments and modeling. Additionally, managing
such a high volume of duplicates can strain system resources and reduce
processing efficiency.
Descriptive
Statistics
· Average and Distribution
The stats from the credit_train dataset show some
interesting numbers. The average loan amount is about $11.76 million, which
might indicate either very large loans or some outliers in the data. The
average credit score is quite high at 1,076, which is unusual since most scores
fall between 300 and 850—this could point to some data entry issues. Similarly,
the average annual income stands at $1.38 million, which is quite hefty and
might suggest either very high earners or some anomalies. With 100,514 records,
there's a lot of data to work with, but these high averages suggest it might be
worth double-checking for any inconsistencies or errors.
·
Distribution of Loan Amounts, Credit Scores,
and Income
The dataset reveals a lot of variety. Loan amounts range from $10,802 to a whopping $100 million, showing a huge difference in loan sizes. Credit scores go from as low as 585 to an unusually high 7,510, which might suggest some data issues since typical scores don’t go beyond 850. Incomes also vary widely, from $76,627 to $165 million, reflecting a broad spectrum of earnings. These wide ranges point to diverse financial situations but also hint that there might be some data inconsistencies or outliers worth looking into.
Risk
Assessment
High-Risk Customers
The SQL query classifies
customers into risk categories based on several financial and credit
attributes. It calculates the debt-to-income ratio and converts categorical job
tenure data into numerical values for comparison. Customers are categorized as
'High Risk' if they have a credit score below 600, a debt-to-income ratio over
50%, more than 2 credit problems, less than 12 months since their last
delinquency, or a home ownership status of 'rent' or 'mortgage'. Those with a
credit score between 600 and 699, a debt-to-income ratio between 30% and 50%, 1
to 2 credit problems, or 12 to 24 months since the last delinquency are
classified as 'Medium Risk'. All other customers are considered 'Low Risk'.
This classification helps in assessing and managing lending risks by
identifying high-risk customers who may require closer scrutiny or different
loan terms.
4.5 Data Visualization
Distribution of Credit Scores
The graph depicts the distribution of credit scores within a dataset, revealing a right-skewed pattern where the majority of scores are concentrated at the lower end, particularly below 1000. The high frequency of low scores suggests that most individuals in the dataset have relatively poor credit, while the long tail extending to higher scores indicates a smaller number of individuals with significantly better credit. This distribution suggests potential challenges in credit risk assessment, with a focus on a largely high-risk population and a few cases of exceptionally good credit.
Counts of Loan Statuses
The
bar chart shows the counts of loan statuses in a dataset, comparing the number
of loans that were "Fully Paid" versus those that were "Charged
Off." The "Fully Paid" category has a significantly higher count,
with around 75,000 loans, indicating a majority of loans were successfully
repaid. In contrast, the "Charged Off" category, representing loans
that were defaulted on or not fully repaid, has a lower count of approximately
20,000 loans. This distribution suggests a relatively higher success rate in
loan repayment, with a smaller proportion of loans resulting in default.
5. Results
1. Risk
Classification: The analysis reveals a diverse range of risk profiles among customers:
- High Risk (58.7%):
The majority of customers are classified as high risk. This group
typically has low credit scores, high debt-to-income (DTI) ratios, and
multiple credit problems. These attributes significantly increase the
likelihood of loan defaults.
- Medium Risk (15.2%): A
smaller segment of customers falls into the medium-risk category. They
generally exhibit moderate credit scores and manageable DTI ratios but
have some credit issues or a recent history of delinquency.
- Low Risk (25.9%):
This group shows strong financial stability with high credit scores, low
DTI ratios, and clean credit histories, indicating a lower likelihood of
default.
2. Data
Quality Issues:
- Missing Values:
19.57% of records have missing values for key attributes like
Credit_Score, Current_Loan_Amount, Annual_Income, and
Number_of_Credit_Problems. This missing data could compromise the accuracy
of the risk assessment.
- Inconsistent Records:
4,551 records fall outside expected ranges, and 11,486 records show
inconsistencies where the Current_Loan_Amount exceeds Annual_Income,
suggesting data entry errors.
- Duplicates:
The dataset includes 18,002 duplicate entries, which could skew the
analysis results.
- Unusual Averages:
Average values for loan amounts ($11.76 million), credit scores (1,076),
and annual incomes ($1.38 million) are unusually high, indicating
potential anomalies or errors.
3. Anomalies and
Insights
Unexpected Patterns:
- The
unusually high credit scores (up to 7,510) and average annual incomes
($1.38 million) suggest possible data entry errors or outliers. Typical
credit scores and income figures are significantly lower.
- The
presence of 11,486 records where the Current_Loan_Amount exceeds the
Annual_Income points to significant data inconsistencies, potentially
indicating incorrect data entries or fraudulent activity.
Further
Investigation Areas:
- Investigate
and correct anomalies in credit scores and income values to ensure data
accuracy.
- Examine records with Current_Loan_Amount exceeding Annual_Income to understand whether they represent errors, fraudulent activity, or unusual but valid financial situations.
6. Recommendations
1. Risk
Mitigation Strategies:
- Tighten Credit Score
Requirements: Consider increasing the minimum credit
score required for loan approval to reduce the proportion of high-risk
customers.
- Review DTI Ratios:
Implement stricter guidelines for acceptable debt-to-income ratios to
mitigate risk.
- Financial Counseling:
Offer counseling and financial planning services to medium and high-risk
customers to improve their financial stability.
2. Data
Quality Improvements:
- Complete Customer
Profiles: Ensure that all critical fields (e.g.,
Credit_Score, Current_Loan_Amount, Annual_Income) are consistently filled
to avoid missing values.
- Enhanced Data
Validation: Implement more robust data entry
validation rules to prevent inconsistencies such as Current_Loan_Amount
exceeding Annual_Income.
- Duplicate Detection:
Use algorithms to identify and remove duplicate entries to maintain data
integrity.
7. Conclusion
In order to
evaluate lending risks and pinpoint important variables influencing
creditworthiness, we examined a bank loan dataset for this project. 58.7% of
clients are classified as high risk, 15.2% as medium risk, and 25.9% as low
risk, according to our data. Significant problems with missing values and
inconsistent data were also found, highlighting the need for better data
validation and quality control.
In order to resolve these data problems going ahead, anomalies must be fixed
and missing values must be filled in. Enhancing risk assessment and management
techniques can be achieved by creating more sophisticated risk models and
further segmenting the client base. Furthermore, continuous external data
analysis and integration can improve lending process decision-making and yield
deeper insights.
To access the full SQL code:
https://drive.google.com/file/d/11wJtdKrxfNVhsliOmaKB6cogYGhC0CPt/view?usp=sharing
Data Source :
https://www.kaggle.com/datasets/zaurbegiev/my-dataset
Comments
Post a Comment