Homework 2

Aleksei Parm

2022-09-24

About

The homework is about fitting linear models to data and about using classical methods for classification.

Homework uses real data from https://www.bondora.com/en/public-reports (just selected variables and reduced number of observations).

Libraries

Packages tidyverse, broom and tidymodels are loaded.

Task 1

The aim is to form a model for predicting variable EAD1, which describes loss from a loan in the case of default.

Data

Use the data from Loss.csv.

loss_df <- read_delim("Loss.csv", delim=",")
## Rows: 4982 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Gender, NrOfDependants, Education, ExistingLiabilities, EAD1, HomeO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(loss_df)
## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities  EAD1 HomeOwnershipType
##    <dbl>          <dbl>     <dbl>               <dbl> <dbl>             <dbl>
## 1      1              0         4                   8 1252.                 1
## 2      0              1         4                   7 6723.                 8
## 3      0              0         2                   1  546.                 4
## 4      1              0         5                   9  496.                 1
## 5      1              0         2                   2  498.                 3
## 6      0              0         4                   9  602.                 2

Info about variables

  • Gender - 0 Male, 1 Woman, 2 Undefined
  • NrOfDependants - Number of children or other dependants
  • Education - 1 Primary education, 2 Basic education, 3 Vocational education, 4 Secondary education, 5 Higher education
  • ExistingLiabilities - Borrower’s number of existing liabilities
  • HomeOwnershipType - 0 Homeless, 1 Owner, 2 Living with parents, 3 Tenant pre-furnished property, 4 Tenant unfurnished property, 5 Council house, 6 Joint tenant, 7 Joint ownership, 8 Mortgage, 9 Owner with encumbrance, 10 Other
  • EAD1 - Exposure at default, outstanding principal at default

Note that variables Education, Gender and HomeOwnershipType are actually nominal variables which are encoded by numbers, so before fitting models transform them to factor variables.

loss_df <- loss_df %>%
  mutate(
    Education = as.factor(Education), 
    Gender = as.factor(Gender), 
    HomeOwnershipType = as.factor(HomeOwnershipType)
  )

head(loss_df)
## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities  EAD1 HomeOwnershipType
##   <fct>           <dbl> <fct>                   <dbl> <dbl> <fct>            
## 1 1                   0 4                           8 1252. 1                
## 2 0                   1 4                           7 6723. 8                
## 3 0                   0 2                           1  546. 4                
## 4 1                   0 5                           9  496. 1                
## 5 1                   0 2                           2  498. 3                
## 6 0                   0 4                           9  602. 2
summary(loss_df)
##  Gender   NrOfDependants   Education ExistingLiabilities      EAD1      
##  0:2939   Min.   :0.0000   1: 309    Min.   : 0.000      Min.   :    0  
##  1:1564   1st Qu.:0.0000   2: 405    1st Qu.: 1.000      1st Qu.:  934  
##  2: 479   Median :0.0000   3:1073    Median : 2.000      Median : 2008  
##           Mean   :0.7102   4:1978    Mean   : 3.395      Mean   : 2425  
##           3rd Qu.:1.0000   5:1217    3rd Qu.: 5.000      3rd Qu.: 3107  
##           Max.   :8.0000             Max.   :30.000      Max.   :10630  
##                                                                         
##  HomeOwnershipType
##  1      :1803     
##  2      : 918     
##  3      : 861     
##  8      : 562     
##  4      : 281     
##  7      : 177     
##  (Other): 380

Model

Fit a model with all predictors for EAD1.

model <- lm(EAD1 ~ ., data = loss_df)


Analyze the fit - are the assumptions of computing statistics of fit satisfied?

summary(model)
## 
## Call:
## lm(formula = EAD1 ~ ., data = loss_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3453.5 -1415.0  -454.4   704.4  8667.6 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2184.736    128.192  17.043  < 2e-16 ***
## Gender1              -38.785     64.360  -0.603 0.546784    
## Gender2             -640.611    104.171  -6.150 8.38e-10 ***
## NrOfDependants        85.202     30.470   2.796 0.005189 ** 
## Education2            39.434    156.473   0.252 0.801037    
## Education3           279.518    132.888   2.103 0.035480 *  
## Education4            31.032    125.882   0.247 0.805295    
## Education5           504.239    131.099   3.846 0.000121 ***
## ExistingLiabilities   44.899      8.456   5.310 1.15e-07 ***
## HomeOwnershipType2  -524.865     84.877  -6.184 6.76e-10 ***
## HomeOwnershipType3   -40.083     84.722  -0.473 0.636152    
## HomeOwnershipType4    69.635    131.352   0.530 0.596040    
## HomeOwnershipType5  -230.445    241.929  -0.953 0.340873    
## HomeOwnershipType6  -231.806    206.508  -1.123 0.261702    
## HomeOwnershipType7  -177.065    161.014  -1.100 0.271520    
## HomeOwnershipType8   216.127     99.340   2.176 0.029630 *  
## HomeOwnershipType9   363.909    314.158   1.158 0.246772    
## HomeOwnershipType10 -342.993    169.637  -2.022 0.043237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2031 on 4964 degrees of freedom
## Multiple R-squared:  0.0446, Adjusted R-squared:  0.04133 
## F-statistic: 13.63 on 17 and 4964 DF,  p-value: < 2.2e-16

If we consider variables with p-value smaller than 0.05 to be significant, then numeric variables (NrOfDependants, ExistingLiabilities) are significant.

Also significant differences between Gender = 0 (Male) and Gender = 2 (Undefined), Education = 1 and Education = 3 or 5, some HomeOwnershipTypes.

Accuracy

What can you say about accuracy of the model compared to predicting just the average loss for all claim?

knitr::kable(glance(model))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0446004 0.0413285 2031.249 13.63128 0 17 -45005.07 90048.14 90171.9 20481336507 4964 4982

The predictions of the model are only very slightly better than predicting by average (adjusted R2 is 0.041), so usefulness of the model for predictions is very small.

Interaction

Is it a good idea to add the interaction of Gender and Education to the model? Why?

boxplot(EAD1 ~ Gender * Education, data=loss_df)

It isn’t good idea, because we don’t have big differences between males and females at each educational level.

And we should have enough data for each group (total 15 groups).

loss_df %>%
  group_by(Gender, Education) %>%
  count()
## # A tibble: 15 × 3
## # Groups:   Gender, Education [15]
##    Gender Education     n
##    <fct>  <fct>     <int>
##  1 0      1           186
##  2 0      2           238
##  3 0      3           716
##  4 0      4          1176
##  5 0      5           623
##  6 1      1            73
##  7 1      2           157
##  8 1      3           330
##  9 1      4           621
## 10 1      5           383
## 11 2      1            50
## 12 2      2            10
## 13 2      3            27
## 14 2      4           181
## 15 2      5           211


But we can try.

model2 <- lm(EAD1 ~ Gender * Education + NrOfDependants + ExistingLiabilities + HomeOwnershipType, data = loss_df)

knitr::kable(glance(model2))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0480645 0.0432626 2029.199 10.00941 0 25 -44996.02 90046.05 90221.91 20407074192 4956 4982
knitr::kable(tidy(model2))
term estimate std.error statistic p.value
(Intercept) 2097.04836 158.330392 13.2447621 0.0000000
Gender1 17.53192 281.325504 0.0623190 0.9503113
Gender2 -145.55828 323.961538 -0.4493073 0.6532296
Education2 394.24928 200.719959 1.9641758 0.0495655
Education3 349.96641 168.292058 2.0795183 0.0376210
Education4 73.76601 161.543392 0.4566328 0.6479550
Education5 654.70059 170.938478 3.8300364 0.0001297
NrOfDependants 80.50834 31.260590 2.5753940 0.0100413
ExistingLiabilities 44.25847 8.511936 5.1995766 0.0000002
HomeOwnershipType2 -531.70611 84.989223 -6.2561592 0.0000000
HomeOwnershipType3 -41.76838 84.770693 -0.4927219 0.6222309
HomeOwnershipType4 76.27188 131.429467 0.5803256 0.5617215
HomeOwnershipType5 -251.41838 242.090333 -1.0385313 0.2990735
HomeOwnershipType6 -250.40675 206.424058 -1.2130696 0.2251610
HomeOwnershipType7 -195.08660 161.143593 -1.2106383 0.2260918
HomeOwnershipType8 210.45681 99.379252 2.1177137 0.0342490
HomeOwnershipType9 385.07624 314.691139 1.2236641 0.2211372
HomeOwnershipType10 -322.17508 169.871938 -1.8965762 0.0579420
Gender1:Education2 -729.96131 351.715796 -2.0754294 0.0379982
Gender2:Education2 -345.28324 731.268685 -0.4721701 0.6368261
Gender1:Education3 20.97986 312.236753 0.0671921 0.9464315
Gender2:Education3 -493.00340 514.821001 -0.9576210 0.3383006
Gender1:Education4 69.15368 299.663918 0.2307708 0.8175024
Gender2:Education4 -339.81263 363.156140 -0.9357205 0.3494627
Gender1:Education5 -92.00894 311.730014 -0.2951559 0.7678872
Gender2:Education5 -749.38420 362.881355 -2.0650942 0.0389660

Adjusted R2 is 0.043 vs 0.041, but it isn’t a big difference.

And also p-values and standard errors of Gender2:Education2 … Gender1:Education5 are very big.


I think that removing some outliers from the data set can improve the linear model more than such interaction. For example, NrOfDependants >= 6 and ExistingLiabilities >= 20.

ggplot(loss_df, aes(NrOfDependants, EAD1)) + geom_point()

ggplot(loss_df, aes(ExistingLiabilities, EAD1)) + geom_point()


It also seems that union of several HomeOwnershipTypes can be useful, because some of them quite similar.

ggplot(loss_df, aes(HomeOwnershipType, EAD1)) + geom_boxplot()

Task 2

Use the data from Loan.csv.

The aim is to predict probability of default of new applications.

loan_df_num <- read_delim("Loan.csv", delim=",")
## Rows: 7141 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Gender, NrOfDependants, Education, ExistingLiabilities, HomeOwnersh...
## lgl (1): default
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(loan_df_num)
## # A tibble: 6 × 6
##   Gender NrOfDependants Education ExistingLiabilities HomeOwnershipType default
##    <dbl>          <dbl>     <dbl>               <dbl>             <dbl> <lgl>  
## 1      1              0         4                   8                 1 TRUE   
## 2      0              1         4                   7                 8 TRUE   
## 3      0              0         2                   1                 4 TRUE   
## 4      0              1         4                   9                 6 FALSE  
## 5      0              0         4                   7                 3 FALSE  
## 6      1              0         4                   7                 9 FALSE

Note that again Education, Gender and HomeOwnershipType are actually nominal variables.

loan_df <- loan_df_num %>%
  mutate(
    Education = as.factor(Education), 
    Gender = as.factor(Gender), 
    HomeOwnershipType = as.factor(HomeOwnershipType)
  )

Logistic regression model

Fit the best logistic regression model you can find in reasonable amount of time and describe what you tried and why the final model is the best of the ones you compared.

model <- glm(default ~ ., data = loan_df, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = loan_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3764  -0.9881  -0.8944   1.3319   1.8017  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.031356   0.268926  -0.117  0.90718    
## Gender1             -0.089752   0.050699  -1.770  0.07668 .  
## Gender2              0.443773   0.167528   2.649  0.00807 ** 
## NrOfDependants       0.035874   0.025329   1.416  0.15669    
## Education2          -0.132447   0.273110  -0.485  0.62771    
## Education3          -0.087568   0.269528  -0.325  0.74526    
## Education4          -0.366691   0.266583  -1.376  0.16897    
## Education5          -0.513985   0.268264  -1.916  0.05537 .  
## ExistingLiabilities -0.014910   0.007474  -1.995  0.04607 *  
## HomeOwnershipType2   0.046811   0.074251   0.630  0.52841    
## HomeOwnershipType3   0.019163   0.080126   0.239  0.81098    
## HomeOwnershipType4  -0.040422   0.088840  -0.455  0.64911    
## HomeOwnershipType5   0.412402   0.191561   2.153  0.03133 *  
## HomeOwnershipType6  -0.050029   0.135264  -0.370  0.71149    
## HomeOwnershipType7  -0.351173   0.103841  -3.382  0.00072 ***
## HomeOwnershipType8  -0.217967   0.084712  -2.573  0.01008 *  
## HomeOwnershipType9  -0.610535   0.201174  -3.035  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9504.9  on 7140  degrees of freedom
## Residual deviance: 9398.1  on 7124  degrees of freedom
## AIC: 9432.1
## 
## Number of Fisher Scoring iterations: 4

I tried other models, but they were even worse.


Produce a box plot, which shows predicted probabilities for defaulting loans and for non-defaulting loans.

ggplot(augment(model, loan_df, type.predict="response")) + 
  geom_boxplot(aes(x=default, y=.fitted))


Produce confusion matrix when using the model for classification with default cutoff.

augment(model, type.predict = "response") %>%
  mutate(
    truth = factor(default),
    prediction = factor(if_else(.fitted > 0.5, TRUE, FALSE))
  ) %>%
  conf_mat(truth, prediction)
##           Truth
## Prediction FALSE TRUE
##      FALSE  4347 2672
##      TRUE     59   63


Can defaulting loans be accurately predicted?

NO


LDA

Fit LDA model to the data using all variables as numeric variables and produce box plot of probabilities of default for defaulting loans and non-defaulting loans.

(predicted probabilities for classes can be obtained by predict(model)$posterior)

model_LDA = MASS::lda(default ~ ., data = loan_df_num)

pred = predict(model_LDA)$posterior[7142:14282]

ggplot(data = loan_df_num) + geom_boxplot(aes(loan_df_num$default, pred))


Compute confusion matrix for predictions of LDA model

#predict(model_LDA)$class

loan_df_num %>% 
  mutate(
    truth = factor(default),
    prediction = predict(model_LDA)$class
  ) %>% 
  conf_mat(truth, prediction)
##           Truth
## Prediction FALSE TRUE
##      FALSE  4385 2719
##      TRUE     21   16