Skip to main content

The Loan Detective: Identifying Risks and Anomalies in Financial Data

1. Introduction

·         The " The Loan Detective: Identifying Risks and Anomalies in Financial Data" aims to evaluate the risk associated with loan applicants using a comprehensive dataset containing customer and loan details. In the financial industry, accurately assessing loan risk is crucial for minimizing potential losses and ensuring the stability of lending institutions. This project leverages SQL to explore and analyze key risk factors, providing insights into customer profiles, creditworthiness, and potential default risks.

The dataset used in this project includes a range of attributes such as customer demographics, loan amounts, credit scores, and repayment history. By examining these factors, we aim to identify high-risk customers, understand the characteristics of defaulted loans, and propose strategies for improving loan approval processes.

The project not only focuses on the technical aspects of data analysis but also incorporates an auditing perspective to ensure data integrity and compliance. This involves checking for data completeness, consistency, and identifying any anomalies or irregularities. The findings from this project will be valuable for internal auditors, risk managers, and decision-makers in the banking sector, providing them with actionable insights to enhance their risk management strategies.


2. Dataset Description

Data Source: The dataset from Kaggle contains information on bank loan applications and their statuses, intended for credit scoring and risk analysis. It includes features like credit score, loan amount, term, income, and more information useful for evaluating applicants' creditworthiness.

· Loan ID - A unique identifier for each loan.

· Customer ID - A unique identifier for each customer.

· Loan Status - The current status of the loan (e.g., Fully Paid, Charged Off).

· Current Loan Amount - The amount currently owed on the loan.

· Term - The term or duration of the loan (e.g., Short Term, Long Term).

· Credit Score - The credit score of the customer.

· Annual Income - The customer's annual income.

· Years in current job - The number of years the customer has been in their current job.

· Home Ownership - The customer's home ownership status (e.g., Rent, Own Home, Home Mortgage).

· Purpose - The purpose for which the loan was taken (e.g., Debt Consolidation, Home Improvements).

· Monthly Debt - The customer's monthly debt obligations.

· Years of Credit History - The number of years the customer has had credit.

· Months since last delinquent - The number of months since the customer last had a delinquency.

· Number of Open Accounts - The number of open credit accounts the customer has.

· Number of Credit Problems - The number of credit-related issues (e.g., late payments, defaults).

· Current Credit Balance - The customer's current outstanding balance on all credit accounts.

· Maximum Open Credit - The maximum amount of credit available to the customer.

· Bankruptcies - The number of bankruptcies filed by the customer.

· Tax Liens - The number of tax liens on the customer's assets.


3. Objective

The objective of this project is to analyze a dataset containing various attributes related to bank loans, including loan status, customer information, credit score, and financial history. The primary focus is to assess the risk associated with each loan by identifying key indicators of creditworthiness and potential default. This analysis aims to provide insights into patterns and trends that can assist in making informed lending decisions and improving the overall credit risk management process. The project will also highlight important metrics that influence loan outcomes and recommend strategies for mitigating risk.


4. Methodology

4.1 Data Acquisition

Dataset Source: Obtain the dataset from the relevant source, ensuring it includes all necessary attributes related to bank loans, customer information, credit score, and financial history.

Database Setup: A new database named CREDIT RISK was created using Microsoft SQL Server Management Studio.


 

4.2 Data Preparation

Data Cleaning: During the initial data exploration, missing values were detected in several columns, including sensitive fields such as Bankruptcies, Credit Score, and Months Since Last Delinquent.

Approach for Handling Missing Values: In columns like Bankruptcies and Credit Score, missing values were retained as-is due to regulatory constraints and to ensure data integrity. For example: In Bankruptcies Column null values were not imputed, preserving the original state of the data. The Credit Score Column Entries marked as "N/A" or missing were left unchanged.

Standardizing Formats: Monetary values across columns such as Current Loan Amount and Monthly Debt were checked for consistent formatting (e.g., no mixed currency symbols or decimal separators). Credit-related fields like Credit Score were normalized to a consistent numerical range where possible.

Data Import: Using SQL commands the data was imported into the CREDIT RISK database. The data is correctly imported and structured into tables with appropriate data types.

4.3 Risk Assessment

Credit Score

The credit score is a numerical measure of a borrower’s creditworthiness, derived from their credit history. This score plays a crucial role in assessing the risk associated with lending to an individual. A lower credit score is indicative of higher risk, as it suggests a greater likelihood of default. For instance, borrowers with credit scores below a certain threshold, such as 600, are generally categorized as high-risk. This is because a lower score often reflects a history of financial difficulties or poor credit management, which increases the probability of missed payments or defaults on the loan.

Debt-to-Income

The Debt-to-Income (DTI) ratio is a metric that assesses a borrower’s total monthly debt payments in relation to their gross monthly income. This ratio is instrumental in evaluating the borrower’s ability to manage additional debt. A higher DTI ratio indicates that a borrower is carrying a substantial amount of debt relative to their income, which can heighten the risk of default. For instance, a DTI ratio exceeding 40% is commonly viewed as high risk, as it signifies that a significant portion of the borrower’s income is already committed to existing debt obligations, leaving less capacity to handle new loan payments.

Current Loan Amount:

The current loan amount refers to the outstanding balance of the loan at any given time. This figure is crucial in assessing loan risk, as larger loan amounts can signal increased risk, particularly if they surpass the borrower’s ability to repay based on their financial profile. When a borrower has a substantial loan balance relative to their income and financial stability, it may strain their resources and elevate the likelihood of default. Thus, higher loan amounts can indicate greater financial pressure and potential repayment issues.

Annual Income

The annual income of a borrower is a key factor in evaluating their capacity to fulfill loan obligations. It provides insight into their overall financial stability and ability to manage loan repayments. When the annual income is relatively low compared to the loan amount and monthly debt payments, it can heighten the risk of default. A lower income means that a borrower may struggle to cover both existing debts and new loan payments, increasing the likelihood of repayment issues. Thus, a lower annual income relative to debt obligations can signal greater risk for lenders.

Years of Credit History

Years of credit history measures the length of time a borrower has been using credit. This metric is important for assessing risk, as a shorter credit history may be associated with higher risk. Limited credit behavior data can make it challenging to evaluate the borrower’s creditworthiness and repayment reliability. A shorter credit history provides less insight into the borrower’s long-term financial behavior and credit management, which can increase the uncertainty and potential risk of default. Therefore, borrowers with less established credit histories might be viewed as higher risk due to the lack of extensive credit performance data.

Number of Credit Problems

The number of credit problems encompasses issues such as missed payments, defaults, or bankruptcies. This measure is crucial for assessing risk, as a higher count of credit problems signals a greater likelihood of default. Each instance of a credit problem reflects past difficulties in meeting financial obligations, which can predict future repayment issues. Consequently, borrowers with a history of multiple credit problems are considered higher risk, as their past financial behavior suggests a greater probability of continuing difficulties in managing debt.

How can we segment the customers on the basis of their risk levels?

To assess and manage loan risk, customers can be segmented into three distinct risk levels based on several criteria.

·        High Risk customers typically have low credit scores, high debt-to-income (DTI) ratios, large loan amounts relative to their income, and multiple credit problems. These borrowers are more likely to default, and risk mitigation strategies for them might include imposing higher interest rates or additional requirements to safeguard against potential losses.

·        Medium Risk customers generally exhibit moderate credit scores and manageable DTI ratios, with some credit issues but not severe. They present a moderate risk, necessitating closer monitoring and periodic reviews of their financial situation to prevent potential defaults.

·        Low Risk customers have high credit scores, low DTI ratios, and clean credit histories, indicating they are less likely to default. These borrowers may be offered better loan terms and conditions due to their favorable risk profile, reflecting their strong financial stability and reliability.

 

4.4 Query Development

Data Consistency Checks

             Checking for missing values

This query identifies the number of records with missing values in critical fields like Credit_Score, Current_Loan_Amount, Annual_Income, and Number_of_Credit_Problems. It also calculates the percentage of such records relative to the total dataset. Handling missing values is crucial for ensuring the accuracy and completeness of the data.

In the dataset, about 19.57% of the data is missing, totaling 19,668 missing values. This means that almost one-fifth of the information isn’t available, which could affect the accuracy and reliability of any analysis we perform.

·                  Validating Data Ranges

   

With 4,551 records falling outside the expected ranges, this indicates a significant portion of the dataset may contain errors or inconsistencies. These out-of-range values could suggest issues such as data entry mistakes, anomalies, or problems with the data collection process. Addressing these records is crucial for improving data quality and ensuring that your analyses are based on accurate and reliable information.

·                   Check for Logical Consistency

   

The presence of 11,486 records in the credit_train dataset where the Current_Loan_Amount exceeds the Annual_Income highlights a significant data inconsistency issue. This large number of inconsistencies suggests potential errors in data entry or collection, which could distort credit risk assessments and lead to inaccurate conclusions about borrowers' creditworthiness. Such discrepancies might also indicate possible fraudulent activity or deliberate misrepresentation, warranting further investigation. Additionally, the high volume of inconsistencies underscores the need for improved data validation and quality assurance processes to prevent similar issues in the future. Addressing these concerns is crucial for maintaining the reliability of credit risk analyses and ensuring the accuracy of the dataset.

·                   Check for Duplicate Entries

    

The detection of 18,002 duplicate entries in the credit_train dataset reveals a major data redundancy issue. These duplicates can skew analysis results by inflating data counts and distorting conclusions, thereby impacting the accuracy of credit risk assessments and modeling. Additionally, managing such a high volume of duplicates can strain system resources and reduce processing efficiency.

Descriptive Statistics

·                   Average and Distribution

        

The stats from the credit_train dataset show some interesting numbers. The average loan amount is about $11.76 million, which might indicate either very large loans or some outliers in the data. The average credit score is quite high at 1,076, which is unusual since most scores fall between 300 and 850—this could point to some data entry issues. Similarly, the average annual income stands at $1.38 million, which is quite hefty and might suggest either very high earners or some anomalies. With 100,514 records, there's a lot of data to work with, but these high averages suggest it might be worth double-checking for any inconsistencies or errors.

·                   Distribution of Loan Amounts, Credit Scores, and Income

The dataset reveals a lot of variety. Loan amounts range from $10,802 to a whopping $100 million, showing a huge difference in loan sizes. Credit scores go from as low as 585 to an unusually high 7,510, which might suggest some data issues since typical scores don’t go beyond 850. Incomes also vary widely, from $76,627 to $165 million, reflecting a broad spectrum of earnings. These wide ranges point to diverse financial situations but also hint that there might be some data inconsistencies or outliers worth looking into.

Risk Assessment

     High-Risk Customers


The SQL query classifies customers into risk categories based on several financial and credit attributes. It calculates the debt-to-income ratio and converts categorical job tenure data into numerical values for comparison. Customers are categorized as 'High Risk' if they have a credit score below 600, a debt-to-income ratio over 50%, more than 2 credit problems, less than 12 months since their last delinquency, or a home ownership status of 'rent' or 'mortgage'. Those with a credit score between 600 and 699, a debt-to-income ratio between 30% and 50%, 1 to 2 credit problems, or 12 to 24 months since the last delinquency are classified as 'Medium Risk'. All other customers are considered 'Low Risk'. This classification helps in assessing and managing lending risks by identifying high-risk customers who may require closer scrutiny or different loan terms.

4.5 Data Visualization

Distribution of Credit Scores

The graph depicts the distribution of credit scores within a dataset, revealing a right-skewed pattern where the majority of scores are concentrated at the lower end, particularly below 1000. The high frequency of low scores suggests that most individuals in the dataset have relatively poor credit, while the long tail extending to higher scores indicates a smaller number of individuals with significantly better credit. This distribution suggests potential challenges in credit risk assessment, with a focus on a largely high-risk population and a few cases of exceptionally good credit.


Counts of Loan Statuses

The bar chart shows the counts of loan statuses in a dataset, comparing the number of loans that were "Fully Paid" versus those that were "Charged Off." The "Fully Paid" category has a significantly higher count, with around 75,000 loans, indicating a majority of loans were successfully repaid. In contrast, the "Charged Off" category, representing loans that were defaulted on or not fully repaid, has a lower count of approximately 20,000 loans. This distribution suggests a relatively higher success rate in loan repayment, with a smaller proportion of loans resulting in default.


 5. Results

1. Risk Classification: The analysis reveals a diverse range of risk profiles among customers:

  • High Risk (58.7%): The majority of customers are classified as high risk. This group typically has low credit scores, high debt-to-income (DTI) ratios, and multiple credit problems. These attributes significantly increase the likelihood of loan defaults.
  • Medium Risk (15.2%): A smaller segment of customers falls into the medium-risk category. They generally exhibit moderate credit scores and manageable DTI ratios but have some credit issues or a recent history of delinquency.
  • Low Risk (25.9%): This group shows strong financial stability with high credit scores, low DTI ratios, and clean credit histories, indicating a lower likelihood of default.

2. Data Quality Issues:

  • Missing Values: 19.57% of records have missing values for key attributes like Credit_Score, Current_Loan_Amount, Annual_Income, and Number_of_Credit_Problems. This missing data could compromise the accuracy of the risk assessment.
  • Inconsistent Records: 4,551 records fall outside expected ranges, and 11,486 records show inconsistencies where the Current_Loan_Amount exceeds Annual_Income, suggesting data entry errors.
  • Duplicates: The dataset includes 18,002 duplicate entries, which could skew the analysis results.
  • Unusual Averages: Average values for loan amounts ($11.76 million), credit scores (1,076), and annual incomes ($1.38 million) are unusually high, indicating potential anomalies or errors.

3. Anomalies and Insights

Unexpected Patterns:

  • The unusually high credit scores (up to 7,510) and average annual incomes ($1.38 million) suggest possible data entry errors or outliers. Typical credit scores and income figures are significantly lower.
  • The presence of 11,486 records where the Current_Loan_Amount exceeds the Annual_Income points to significant data inconsistencies, potentially indicating incorrect data entries or fraudulent activity.

Further Investigation Areas:

  • Investigate and correct anomalies in credit scores and income values to ensure data accuracy.
  • Examine records with Current_Loan_Amount exceeding Annual_Income to understand whether they represent errors, fraudulent activity, or unusual but valid financial situations. 

6. Recommendations

1. Risk Mitigation Strategies:

  • Tighten Credit Score Requirements: Consider increasing the minimum credit score required for loan approval to reduce the proportion of high-risk customers.
  • Review DTI Ratios: Implement stricter guidelines for acceptable debt-to-income ratios to mitigate risk.
  • Financial Counseling: Offer counseling and financial planning services to medium and high-risk customers to improve their financial stability.

2. Data Quality Improvements:

  • Complete Customer Profiles: Ensure that all critical fields (e.g., Credit_Score, Current_Loan_Amount, Annual_Income) are consistently filled to avoid missing values.
  • Enhanced Data Validation: Implement more robust data entry validation rules to prevent inconsistencies such as Current_Loan_Amount exceeding Annual_Income.
  • Duplicate Detection: Use algorithms to identify and remove duplicate entries to maintain data integrity.

 

7. Conclusion

In order to evaluate lending risks and pinpoint important variables influencing creditworthiness, we examined a bank loan dataset for this project. 58.7% of clients are classified as high risk, 15.2% as medium risk, and 25.9% as low risk, according to our data. Significant problems with missing values and inconsistent data were also found, highlighting the need for better data validation and quality control.

In order to resolve these data problems going ahead, anomalies must be fixed and missing values must be filled in. Enhancing risk assessment and management techniques can be achieved by creating more sophisticated risk models and further segmenting the client base. Furthermore, continuous external data analysis and integration can improve lending process decision-making and yield deeper insights.

 

To access the full SQL code: 

 https://drive.google.com/file/d/11wJtdKrxfNVhsliOmaKB6cogYGhC0CPt/view?usp=sharing

Data Source : 

https://www.kaggle.com/datasets/zaurbegiev/my-dataset


 

Comments

Popular posts from this blog

Learnings from Excel Skills For Business: Essentials by Macquarie University, Coursera Certificate

    Eyes are a great gift of God to humankind. We should not subject them to unnecessary strain by looking at dirty data. Instead, we should focus on cleaning up any dirty data we may have. I have recently finished Excel Skills For Business: Essentials by Macquarie University certificate course from Coursera. Best way to check your learnings is to implement them in real life. This is the snapshot of a dataset, that I have taken from Kaggle. It represents the details of laptops of different companies with their prices, memory, GPU and operating systems. Snapshot of uncleaned dataset   LEARNINGS  Use of Fill Handle A plus sign is used as a fill handle; Excel's fill handle is a strong tool that lets you swiftly insert formulas or data into cells. It is shown as a little square in the selected cell or range's lower-right corner.      Adjusted Column Width It helps in making the data more presenta...

The Mysterious Case of the 35% Dal Makhni

When Harry met Sally...oops  When Laziness Meets Data There comes a day in your life where you glance at your expenses and you realize that you have spent  more on food delivery than on groceries or even savings. It happened to me when I did the math and found that my total monthly spending was being utilized by 35% of feeding laziness. Yes, more than one-third of my money went to ordering food online. If my Swiggy account had a loyalty program, I’d be its Platinum Ambassador by now. Armed with this alarming realization, I dived into the data hoping for some clarity—or at least a good laugh at my own expense. What I found was a budgetary roast hotter than the dal makhni I paid a premium for.  The Pie Chart of Poor Decisions When I sorted through my expenses, I was shocked to see food delivery as the absolute winner, consuming 35% of my budget. Rent and bills comprised 18%, and then came those impulsive buys from Nykaa that tallied at 15%. Everything else added up to 32%...

Exploring IBM Data Analytics: A Comprehensive Review

Data Analytics is a skill that is becoming a need in every profession. A skill so common that you'll find multiple courses on multiple websites guaranteeing you success. Choosing the best fit from such a huge lot is not an easy task. It needs thorough research about the course, the instructors and the platform. One such course is  IBM Data Analytics with Excel and R  on Coursera.   This professional certificate is a 9 course series including  1.      Introduction to Data Analytics  2.      Excel basics for Data Analysis 3.      Data visualization and Dashboards with Excel and Cognos 4.      Assessment for Data Analysis and Visualization Foundation 5.      Introduction to R programming for Data science 6.      SQL for Data Science with R 7.      Data analysis with R 8.      Data visu...