Reference_Notebook_Milestone_1_Classification+FINAL

.html

School

Rutgers University *

*We aren’t endorsed by this school

Course

700

Subject

Economics

Date

Apr 30, 2024

Type

html

Pages

Uploaded by DeanQuetzalPerson1017 on coursehero.com

Milestone 1 ¶ Problem Definition ¶ The context: Why is this problem important to solve? The objectives: What is the intended goal? The key questions: What are the key questions that need to be answered? The problem formulation: What is it that we are trying to solve using data science? Data Description: ¶ The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant. • BAD: 1 = Client defaulted on loan, 0 = loan repaid • LOAN: Amount of loan approved. • MORTDUE: Amount due on the existing mortgage. • VALUE: Current value of the property. • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts) • JOB: The type of job that loan applicant has such as manager, self, etc. • YOJ: Years at present job. • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments). • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due). • CLAGE: Age of the oldest credit line in months. • NINQ: Number of recent credit inquiries. • CLNO: Number of existing credit lines. • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow. Important Notes ¶ • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise. • In the notebook, there are markdowns cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs. • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code. • All the outputs in the notebook are just for reference and can be different if you follow a different approach. • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

Import the necessary libraries ¶ In [109]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score from sklearn import tree from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier import scipy.stats as stats from sklearn.model_selection import GridSearchCV import warnings warnings.filterwarnings('ignore') Read the dataset ¶ In [5]: hm=pd.read_csv("hmeq.csv") In [6]: # Copying data to another variable to avoid any changes to original data data=hm.copy() Print the first and last 5 rows of the dataset ¶ In [7]: # Display first five rows # Remove ___________ and complete the code hm.head() Out[7]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1. 1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0. 2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1. 3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN N

BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0. In [8]: # Display last 5 rows # Remove ___________ and complete the code hm.tail() Out[8]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE 5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 Understand the shape of the dataset ¶ In [9]: # Check the shape of the data # Remove ___________ and complete the code print(hm.shape) (5960, 13) Insights dataset has 5901 rows and 13 columns Check the data types of the columns ¶ In [10]: # Check info of the data # Remove ___________ and complete the code hm.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64

8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB Insights __ bad and loan are int, value and reason are obj while rest are float Check for missing values ¶ In [11]: # Analyse missing values - Hint: use isnull() function # Remove ___________ and complete the code #percent_missing = hm.isnull().sum() * 100 / len(hm) print(hm.isnull().sum()) BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64 In [12]: #Check the percentage of missing values in the each column. # Hint: divide the result from the previous code by the number of rows in the dataset # Remove ___________ and complete the code percent_missing = hm.isnull().sum() * 100 / len(hm) missing_value_hm = pd.DataFrame({'column_name': hm.columns, 'percent_missing': percent_missing}) print(missing_value_hm) column_name percent_missing BAD BAD 0.000000 LOAN LOAN 0.000000 MORTDUE MORTDUE 8.691275 VALUE VALUE 1.879195 REASON REASON 4.228188 JOB JOB 4.681208 YOJ YOJ 8.640940 DEROG DEROG 11.879195 DELINQ DELINQ 9.731544 CLAGE CLAGE 5.167785 NINQ NINQ 8.557047 CLNO CLNO 3.724832 DEBTINC DEBTINC 21.258389

Insights __ reason and job are sigificant information, DEBTINC, DEROG have the most null values., debtinc passing over the thereshold Think about it: ¶ • We found the total number of missing values and the percentage of missing values, which is better to consider? • What can be the limit for % missing values in a column in order to avoid it and what are the challenges associated with filling them and avoiding them? We can convert the object type columns to categories converting "objects" to "category" reduces the data space required to store the dataframe Convert the data types ¶ In [13]: cols = data.select_dtypes(['object']).columns.tolist() #adding target variable to this list as this is an classification problem and the target variable is categorical cols.append('BAD') In [14]: cols Out[14]: ['REASON', 'JOB', 'BAD'] In [15]: # Changing the data type of object type column to category. hint use astype() function # remove ___________ and complete the code hm= hm.astype({"BAD":'category', "REASON":'category',"JOB":'category'}) In [16]: # Checking the info again and the datatype of different variable # remove ___________ and complete the code print (hm.dtypes) BAD category LOAN int64 MORTDUE float64 VALUE float64 REASON category JOB category YOJ float64 DEROG float64 DELINQ float64 CLAGE float64 NINQ float64 CLNO float64 DEBTINC float64 dtype: object Analyze Summary Statistics of the dataset ¶ In [17]: # Analyze the summary statistics for numerical variables # Remove ___________ and complete the code

hm.describe().T Out[17]: count mean std min 25% 50% LOAN 5960.0 18607.969799 11207.480417 1100.000000 11100.000000 16300.0000 MORTDUE 5442.0 73760.817200 44457.609458 2063.000000 46276.000000 65019.0000 VALUE 5848.0 101776.048741 57385.775334 8000.000000 66075.500000 89235.5000 YOJ 5445.0 8.922268 7.573982 0.000000 3.000000 7.000000 DEROG 5252.0 0.254570 0.846047 0.000000 0.000000 0.000000 DELINQ 5380.0 0.449442 1.127266 0.000000 0.000000 0.000000 CLAGE 5652.0 179.766275 85.810092 0.000000 115.116702 173.466667 NINQ 5450.0 1.186055 1.728675 0.000000 0.000000 1.000000 CLNO 5738.0 21.296096 10.138933 0.000000 15.000000 20.000000 DEBTINC 4693.0 33.779915 8.601746 0.524499 29.140031 34.818262 Insights __ mean looks reasonable for most categories, with the lean of a loan being $18000 on average. value has a higher standard deviation, there much be a greater range in the value of their houses. the data looka to be maybe skewed to the right or normal, because the mean is higher than the median for almost all catetories but DEBTINC. There seem to be dome outliers in morgage due and value and debtinc ans well as perhaps in loan. In [18]: # Check summary for categorical data - Hint: inside describe function you can use the argument include=['category'] # Remove ___________ and complete the code hm.describe(include=['category']).T Out[18]: count unique top freq BAD 5960 2 0 4771 REASON 5708 2 DebtCon 3928 JOB 5681 6 Other 2388 Insights _ there seem to be more 0s for bad, and the most popular reason is DEbtcon,

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version