Homework 2

Aleksei Parm

2022-09-24

About

The homework is about fitting linear models to data and about using classical methods for classification.

Homework uses real data from https://www.bondora.com/en/public-reports (just selected variables and reduced number of observations).

Libraries

Packages tidyverse, broom and tidymodels are loaded.

Task 1

The aim is to form a model for predicting variable EAD1, which describes loss from a loan in the case of default.

Data

Use the data from Loss.csv.

loss_df <- read_delim("Loss.csv", delim=",")

## Rows: 4982 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Gender, NrOfDependants, Education, ExistingLiabilities, EAD1, HomeO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(loss_df)

## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities  EAD1 HomeOwnershipType
##    <dbl>          <dbl>     <dbl>               <dbl> <dbl>             <dbl>
## 1      1              0         4                   8 1252.                 1
## 2      0              1         4                   7 6723.                 8
## 3      0              0         2                   1  546.                 4
## 4      1              0         5                   9  496.                 1
## 5      1              0         2                   2  498.                 3
## 6      0              0         4                   9  602.                 2

Info about variables

Gender - 0 Male, 1 Woman, 2 Undefined
NrOfDependants - Number of children or other dependants
Education - 1 Primary education, 2 Basic education, 3 Vocational education, 4 Secondary education, 5 Higher education
ExistingLiabilities - Borrower’s number of existing liabilities
HomeOwnershipType - 0 Homeless, 1 Owner, 2 Living with parents, 3 Tenant pre-furnished property, 4 Tenant unfurnished property, 5 Council house, 6 Joint tenant, 7 Joint ownership, 8 Mortgage, 9 Owner with encumbrance, 10 Other
EAD1 - Exposure at default, outstanding principal at default

Note that variables Education, Gender and HomeOwnershipType are actually nominal variables which are encoded by numbers, so before fitting models transform them to factor variables.

loss_df <- loss_df %>%
  mutate(
    Education = as.factor(Education), 
    Gender = as.factor(Gender), 
    HomeOwnershipType = as.factor(HomeOwnershipType)
  )

head(loss_df)

## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities  EAD1 HomeOwnershipType
##   <fct>           <dbl> <fct>                   <dbl> <dbl> <fct>            
## 1 1                   0 4                           8 1252. 1                
## 2 0                   1 4                           7 6723. 8                
## 3 0                   0 2                           1  546. 4                
## 4 1                   0 5                           9  496. 1                
## 5 1                   0 2                           2  498. 3                
## 6 0                   0 4                           9  602. 2

summary(loss_df)

##  Gender   NrOfDependants   Education ExistingLiabilities      EAD1      
##  0:2939   Min.   :0.0000   1: 309    Min.   : 0.000      Min.   :    0  
##  1:1564   1st Qu.:0.0000   2: 405    1st Qu.: 1.000      1st Qu.:  934  
##  2: 479   Median :0.0000   3:1073    Median : 2.000      Median : 2008  
##           Mean   :0.7102   4:1978    Mean   : 3.395      Mean   : 2425  
##           3rd Qu.:1.0000   5:1217    3rd Qu.: 5.000      3rd Qu.: 3107  
##           Max.   :8.0000             Max.   :30.000      Max.   :10630  
##                                                                         
##  HomeOwnershipType
##  1      :1803     
##  2      : 918     
##  3      : 861     
##  8      : 562     
##  4      : 281     
##  7      : 177     
##  (Other): 380

Model

Fit a model with all predictors for EAD1.

model <- lm(EAD1 ~ ., data = loss_df)

Analyze the fit - are the assumptions of computing statistics of fit satisfied?

summary(model)

## 
## Call:
## lm(formula = EAD1 ~ ., data = loss_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3453.5 -1415.0  -454.4   704.4  8667.6 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2184.736    128.192  17.043  < 2e-16 ***
## Gender1              -38.785     64.360  -0.603 0.546784    
## Gender2             -640.611    104.171  -6.150 8.38e-10 ***
## NrOfDependants        85.202     30.470   2.796 0.005189 ** 
## Education2            39.434    156.473   0.252 0.801037    
## Education3           279.518    132.888   2.103 0.035480 *  
## Education4            31.032    125.882   0.247 0.805295    
## Education5           504.239    131.099   3.846 0.000121 ***
## ExistingLiabilities   44.899      8.456   5.310 1.15e-07 ***
## HomeOwnershipType2  -524.865     84.877  -6.184 6.76e-10 ***
## HomeOwnershipType3   -40.083     84.722  -0.473 0.636152    
## HomeOwnershipType4    69.635    131.352   0.530 0.596040    
## HomeOwnershipType5  -230.445    241.929  -0.953 0.340873    
## HomeOwnershipType6  -231.806    206.508  -1.123 0.261702    
## HomeOwnershipType7  -177.065    161.014  -1.100 0.271520    
## HomeOwnershipType8   216.127     99.340   2.176 0.029630 *  
## HomeOwnershipType9   363.909    314.158   1.158 0.246772    
## HomeOwnershipType10 -342.993    169.637  -2.022 0.043237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2031 on 4964 degrees of freedom
## Multiple R-squared:  0.0446, Adjusted R-squared:  0.04133 
## F-statistic: 13.63 on 17 and 4964 DF,  p-value: < 2.2e-16

If we consider variables with p-value smaller than 0.05 to be significant, then numeric variables (NrOfDependants, ExistingLiabilities) are significant.

Also significant differences between Gender = 0 (Male) and Gender = 2 (Undefined), Education = 1 and Education = 3 or 5, some HomeOwnershipTypes.

Accuracy

What can you say about accuracy of the model compared to predicting just the average loss for all claim?

knitr::kable(glance(model))

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.0446004	0.0413285	2031.249	13.63128	0	17	-45005.07	90048.14	90171.9	20481336507	4964	4982

The predictions of the model are only very slightly better than predicting by average (adjusted R² is 0.041), so usefulness of the model for predictions is very small.

Interaction

Is it a good idea to add the interaction of Gender and Education to the model? Why?

boxplot(EAD1 ~ Gender * Education, data=loss_df)

It isn’t good idea, because we don’t have big differences between males and females at each educational level.

And we should have enough data for each group (total 15 groups).

loss_df %>%
  group_by(Gender, Education) %>%
  count()

## # A tibble: 15 × 3
## # Groups:   Gender, Education [15]
##    Gender Education     n
##    <fct>  <fct>     <int>
##  1 0      1           186
##  2 0      2           238
##  3 0      3           716
##  4 0      4          1176
##  5 0      5           623
##  6 1      1            73
##  7 1      2           157
##  8 1      3           330
##  9 1      4           621
## 10 1      5           383
## 11 2      1            50
## 12 2      2            10
## 13 2      3            27
## 14 2      4           181
## 15 2      5           211

But we can try.

model2 <- lm(EAD1 ~ Gender * Education + NrOfDependants + ExistingLiabilities + HomeOwnershipType, data = loss_df)

knitr::kable(glance(model2))

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.0480645	0.0432626	2029.199	10.00941	0	25	-44996.02	90046.05	90221.91	20407074192	4956	4982

knitr::kable(tidy(model2))

term	estimate	std.error	statistic	p.value
(Intercept)	2097.04836	158.330392	13.2447621	0.0000000
Gender1	17.53192	281.325504	0.0623190	0.9503113
Gender2	-145.55828	323.961538	-0.4493073	0.6532296
Education2	394.24928	200.719959	1.9641758	0.0495655
Education3	349.96641	168.292058	2.0795183	0.0376210
Education4	73.76601	161.543392	0.4566328	0.6479550
Education5	654.70059	170.938478	3.8300364	0.0001297
NrOfDependants	80.50834	31.260590	2.5753940	0.0100413
ExistingLiabilities	44.25847	8.511936	5.1995766	0.0000002
HomeOwnershipType2	-531.70611	84.989223	-6.2561592	0.0000000
HomeOwnershipType3	-41.76838	84.770693	-0.4927219	0.6222309
HomeOwnershipType4	76.27188	131.429467	0.5803256	0.5617215
HomeOwnershipType5	-251.41838	242.090333	-1.0385313	0.2990735
HomeOwnershipType6	-250.40675	206.424058	-1.2130696	0.2251610
HomeOwnershipType7	-195.08660	161.143593	-1.2106383	0.2260918
HomeOwnershipType8	210.45681	99.379252	2.1177137	0.0342490
HomeOwnershipType9	385.07624	314.691139	1.2236641	0.2211372
HomeOwnershipType10	-322.17508	169.871938	-1.8965762	0.0579420
Gender1:Education2	-729.96131	351.715796	-2.0754294	0.0379982
Gender2:Education2	-345.28324	731.268685	-0.4721701	0.6368261
Gender1:Education3	20.97986	312.236753	0.0671921	0.9464315
Gender2:Education3	-493.00340	514.821001	-0.9576210	0.3383006
Gender1:Education4	69.15368	299.663918	0.2307708	0.8175024
Gender2:Education4	-339.81263	363.156140	-0.9357205	0.3494627
Gender1:Education5	-92.00894	311.730014	-0.2951559	0.7678872
Gender2:Education5	-749.38420	362.881355	-2.0650942	0.0389660

Adjusted R² is 0.043 vs 0.041, but it isn’t a big difference.

And also p-values and standard errors of Gender2:Education2 … Gender1:Education5 are very big.

I think that removing some outliers from the data set can improve the linear model more than such interaction. For example, NrOfDependants >= 6 and ExistingLiabilities >= 20.

ggplot(loss_df, aes(NrOfDependants, EAD1)) + geom_point()

ggplot(loss_df, aes(ExistingLiabilities, EAD1)) + geom_point()

It also seems that union of several HomeOwnershipTypes can be useful, because some of them quite similar.

ggplot(loss_df, aes(HomeOwnershipType, EAD1)) + geom_boxplot()

Task 2

Use the data from Loan.csv.

The aim is to predict probability of default of new applications.

loan_df_num <- read_delim("Loan.csv", delim=",")

## Rows: 7141 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Gender, NrOfDependants, Education, ExistingLiabilities, HomeOwnersh...
## lgl (1): default
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(loan_df_num)

## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities HomeOwnershipType default
##    <dbl>          <dbl>     <dbl>               <dbl>             <dbl> <lgl>  
## 1      1              0         4                   8                 1 TRUE   
## 2      0              1         4                   7                 8 TRUE   
## 3      0              0         2                   1                 4 TRUE   
## 4      0              1         4                   9                 6 FALSE  
## 5      0              0         4                   7                 3 FALSE  
## 6      1              0         4                   7                 9 FALSE

Note that again Education, Gender and HomeOwnershipType are actually nominal variables.

loan_df <- loan_df_num %>%
  mutate(
    Education = as.factor(Education), 
    Gender = as.factor(Gender), 
    HomeOwnershipType = as.factor(HomeOwnershipType)
  )

Logistic regression model

Fit the best logistic regression model you can find in reasonable amount of time and describe what you tried and why the final model is the best of the ones you compared.

model <- glm(default ~ ., data = loan_df, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = loan_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3764  -0.9881  -0.8944   1.3319   1.8017  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.031356   0.268926  -0.117  0.90718    
## Gender1             -0.089752   0.050699  -1.770  0.07668 .  
## Gender2              0.443773   0.167528   2.649  0.00807 ** 
## NrOfDependants       0.035874   0.025329   1.416  0.15669    
## Education2          -0.132447   0.273110  -0.485  0.62771    
## Education3          -0.087568   0.269528  -0.325  0.74526    
## Education4          -0.366691   0.266583  -1.376  0.16897    
## Education5          -0.513985   0.268264  -1.916  0.05537 .  
## ExistingLiabilities -0.014910   0.007474  -1.995  0.04607 *  
## HomeOwnershipType2   0.046811   0.074251   0.630  0.52841    
## HomeOwnershipType3   0.019163   0.080126   0.239  0.81098    
## HomeOwnershipType4  -0.040422   0.088840  -0.455  0.64911    
## HomeOwnershipType5   0.412402   0.191561   2.153  0.03133 *  
## HomeOwnershipType6  -0.050029   0.135264  -0.370  0.71149    
## HomeOwnershipType7  -0.351173   0.103841  -3.382  0.00072 ***
## HomeOwnershipType8  -0.217967   0.084712  -2.573  0.01008 *  
## HomeOwnershipType9  -0.610535   0.201174  -3.035  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9504.9  on 7140  degrees of freedom
## Residual deviance: 9398.1  on 7124  degrees of freedom
## AIC: 9432.1
## 
## Number of Fisher Scoring iterations: 4

I tried other models, but they were even worse.

Produce a box plot, which shows predicted probabilities for defaulting loans and for non-defaulting loans.

ggplot(augment(model, loan_df, type.predict="response")) + 
  geom_boxplot(aes(x=default, y=.fitted))

Produce confusion matrix when using the model for classification with default cutoff.

augment(model, type.predict = "response") %>%
  mutate(
    truth = factor(default),
    prediction = factor(if_else(.fitted > 0.5, TRUE, FALSE))
  ) %>%
  conf_mat(truth, prediction)

##           Truth
## Prediction FALSE TRUE
##      FALSE  4347 2672
##      TRUE     59   63

Can defaulting loans be accurately predicted?

LDA

Fit LDA model to the data using all variables as numeric variables and produce box plot of probabilities of default for defaulting loans and non-defaulting loans.

(predicted probabilities for classes can be obtained by predict(model)$posterior)

model_LDA = MASS::lda(default ~ ., data = loan_df_num)

pred = predict(model_LDA)$posterior[7142:14282]

ggplot(data = loan_df_num) + geom_boxplot(aes(loan_df_num$default, pred))

Compute confusion matrix for predictions of LDA model

#predict(model_LDA)$class

loan_df_num %>% 
  mutate(
    truth = factor(default),
    prediction = predict(model_LDA)$class
  ) %>% 
  conf_mat(truth, prediction)

##           Truth
## Prediction FALSE TRUE
##      FALSE  4385 2719
##      TRUE     21   16