About
The homework is about fitting linear models to data and about using classical methods for classification.
Homework uses real data from https://www.bondora.com/en/public-reports (just selected variables and reduced number of observations).
Libraries
Packages tidyverse
, broom
and
tidymodels
are loaded.
Task 1
The aim is to form a model for predicting variable EAD1, which describes loss from a loan in the case of default.
Data
Use the data from Loss.csv
.
loss_df <- read_delim("Loss.csv", delim=",")
## Rows: 4982 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Gender, NrOfDependants, Education, ExistingLiabilities, EAD1, HomeO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(loss_df)
## # A tibble: 6 × 6
## Gender NrOfDependants Education ExistingLiabilities EAD1 HomeOwnershipType
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 4 8 1252. 1
## 2 0 1 4 7 6723. 8
## 3 0 0 2 1 546. 4
## 4 1 0 5 9 496. 1
## 5 1 0 2 2 498. 3
## 6 0 0 4 9 602. 2
Info about variables
- Gender - 0 Male, 1 Woman, 2 Undefined
- NrOfDependants - Number of children or other dependants
- Education - 1 Primary education, 2 Basic education, 3 Vocational education, 4 Secondary education, 5 Higher education
- ExistingLiabilities - Borrower’s number of existing liabilities
- HomeOwnershipType - 0 Homeless, 1 Owner, 2 Living with parents, 3 Tenant pre-furnished property, 4 Tenant unfurnished property, 5 Council house, 6 Joint tenant, 7 Joint ownership, 8 Mortgage, 9 Owner with encumbrance, 10 Other
- EAD1 - Exposure at default, outstanding principal at default
Note that variables Education, Gender and HomeOwnershipType are actually nominal variables which are encoded by numbers, so before fitting models transform them to factor variables.
loss_df <- loss_df %>%
mutate(
Education = as.factor(Education),
Gender = as.factor(Gender),
HomeOwnershipType = as.factor(HomeOwnershipType)
)
head(loss_df)
## # A tibble: 6 × 6
## Gender NrOfDependants Education ExistingLiabilities EAD1 HomeOwnershipType
## <fct> <dbl> <fct> <dbl> <dbl> <fct>
## 1 1 0 4 8 1252. 1
## 2 0 1 4 7 6723. 8
## 3 0 0 2 1 546. 4
## 4 1 0 5 9 496. 1
## 5 1 0 2 2 498. 3
## 6 0 0 4 9 602. 2
summary(loss_df)
## Gender NrOfDependants Education ExistingLiabilities EAD1
## 0:2939 Min. :0.0000 1: 309 Min. : 0.000 Min. : 0
## 1:1564 1st Qu.:0.0000 2: 405 1st Qu.: 1.000 1st Qu.: 934
## 2: 479 Median :0.0000 3:1073 Median : 2.000 Median : 2008
## Mean :0.7102 4:1978 Mean : 3.395 Mean : 2425
## 3rd Qu.:1.0000 5:1217 3rd Qu.: 5.000 3rd Qu.: 3107
## Max. :8.0000 Max. :30.000 Max. :10630
##
## HomeOwnershipType
## 1 :1803
## 2 : 918
## 3 : 861
## 8 : 562
## 4 : 281
## 7 : 177
## (Other): 380
Model
Fit a model with all predictors for EAD1.
model <- lm(EAD1 ~ ., data = loss_df)
Analyze the fit - are the assumptions of computing statistics of fit satisfied?
summary(model)
##
## Call:
## lm(formula = EAD1 ~ ., data = loss_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3453.5 -1415.0 -454.4 704.4 8667.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2184.736 128.192 17.043 < 2e-16 ***
## Gender1 -38.785 64.360 -0.603 0.546784
## Gender2 -640.611 104.171 -6.150 8.38e-10 ***
## NrOfDependants 85.202 30.470 2.796 0.005189 **
## Education2 39.434 156.473 0.252 0.801037
## Education3 279.518 132.888 2.103 0.035480 *
## Education4 31.032 125.882 0.247 0.805295
## Education5 504.239 131.099 3.846 0.000121 ***
## ExistingLiabilities 44.899 8.456 5.310 1.15e-07 ***
## HomeOwnershipType2 -524.865 84.877 -6.184 6.76e-10 ***
## HomeOwnershipType3 -40.083 84.722 -0.473 0.636152
## HomeOwnershipType4 69.635 131.352 0.530 0.596040
## HomeOwnershipType5 -230.445 241.929 -0.953 0.340873
## HomeOwnershipType6 -231.806 206.508 -1.123 0.261702
## HomeOwnershipType7 -177.065 161.014 -1.100 0.271520
## HomeOwnershipType8 216.127 99.340 2.176 0.029630 *
## HomeOwnershipType9 363.909 314.158 1.158 0.246772
## HomeOwnershipType10 -342.993 169.637 -2.022 0.043237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2031 on 4964 degrees of freedom
## Multiple R-squared: 0.0446, Adjusted R-squared: 0.04133
## F-statistic: 13.63 on 17 and 4964 DF, p-value: < 2.2e-16
If we consider variables with p-value smaller than 0.05 to be significant, then numeric variables (NrOfDependants, ExistingLiabilities) are significant.
Also significant differences between Gender = 0 (Male) and Gender = 2 (Undefined), Education = 1 and Education = 3 or 5, some HomeOwnershipTypes.
Accuracy
What can you say about accuracy of the model compared to predicting just the average loss for all claim?
knitr::kable(glance(model))
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.0446004 | 0.0413285 | 2031.249 | 13.63128 | 0 | 17 | -45005.07 | 90048.14 | 90171.9 | 20481336507 | 4964 | 4982 |
The predictions of the model are only very slightly better than predicting by average (adjusted R2 is 0.041), so usefulness of the model for predictions is very small.
Interaction
Is it a good idea to add the interaction of Gender and Education to the model? Why?
boxplot(EAD1 ~ Gender * Education, data=loss_df)
It isn’t good idea, because we don’t have big differences between males and females at each educational level.
And we should have enough data for each group (total 15 groups).
loss_df %>%
group_by(Gender, Education) %>%
count()
## # A tibble: 15 × 3
## # Groups: Gender, Education [15]
## Gender Education n
## <fct> <fct> <int>
## 1 0 1 186
## 2 0 2 238
## 3 0 3 716
## 4 0 4 1176
## 5 0 5 623
## 6 1 1 73
## 7 1 2 157
## 8 1 3 330
## 9 1 4 621
## 10 1 5 383
## 11 2 1 50
## 12 2 2 10
## 13 2 3 27
## 14 2 4 181
## 15 2 5 211
But we can try.
model2 <- lm(EAD1 ~ Gender * Education + NrOfDependants + ExistingLiabilities + HomeOwnershipType, data = loss_df)
knitr::kable(glance(model2))
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.0480645 | 0.0432626 | 2029.199 | 10.00941 | 0 | 25 | -44996.02 | 90046.05 | 90221.91 | 20407074192 | 4956 | 4982 |
knitr::kable(tidy(model2))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2097.04836 | 158.330392 | 13.2447621 | 0.0000000 |
Gender1 | 17.53192 | 281.325504 | 0.0623190 | 0.9503113 |
Gender2 | -145.55828 | 323.961538 | -0.4493073 | 0.6532296 |
Education2 | 394.24928 | 200.719959 | 1.9641758 | 0.0495655 |
Education3 | 349.96641 | 168.292058 | 2.0795183 | 0.0376210 |
Education4 | 73.76601 | 161.543392 | 0.4566328 | 0.6479550 |
Education5 | 654.70059 | 170.938478 | 3.8300364 | 0.0001297 |
NrOfDependants | 80.50834 | 31.260590 | 2.5753940 | 0.0100413 |
ExistingLiabilities | 44.25847 | 8.511936 | 5.1995766 | 0.0000002 |
HomeOwnershipType2 | -531.70611 | 84.989223 | -6.2561592 | 0.0000000 |
HomeOwnershipType3 | -41.76838 | 84.770693 | -0.4927219 | 0.6222309 |
HomeOwnershipType4 | 76.27188 | 131.429467 | 0.5803256 | 0.5617215 |
HomeOwnershipType5 | -251.41838 | 242.090333 | -1.0385313 | 0.2990735 |
HomeOwnershipType6 | -250.40675 | 206.424058 | -1.2130696 | 0.2251610 |
HomeOwnershipType7 | -195.08660 | 161.143593 | -1.2106383 | 0.2260918 |
HomeOwnershipType8 | 210.45681 | 99.379252 | 2.1177137 | 0.0342490 |
HomeOwnershipType9 | 385.07624 | 314.691139 | 1.2236641 | 0.2211372 |
HomeOwnershipType10 | -322.17508 | 169.871938 | -1.8965762 | 0.0579420 |
Gender1:Education2 | -729.96131 | 351.715796 | -2.0754294 | 0.0379982 |
Gender2:Education2 | -345.28324 | 731.268685 | -0.4721701 | 0.6368261 |
Gender1:Education3 | 20.97986 | 312.236753 | 0.0671921 | 0.9464315 |
Gender2:Education3 | -493.00340 | 514.821001 | -0.9576210 | 0.3383006 |
Gender1:Education4 | 69.15368 | 299.663918 | 0.2307708 | 0.8175024 |
Gender2:Education4 | -339.81263 | 363.156140 | -0.9357205 | 0.3494627 |
Gender1:Education5 | -92.00894 | 311.730014 | -0.2951559 | 0.7678872 |
Gender2:Education5 | -749.38420 | 362.881355 | -2.0650942 | 0.0389660 |
Adjusted R2 is 0.043 vs 0.041, but it isn’t a big difference.
And also p-values and standard errors of Gender2:Education2 … Gender1:Education5 are very big.
I think that removing some outliers from the data set can improve the linear model more than such interaction. For example, NrOfDependants >= 6 and ExistingLiabilities >= 20.
ggplot(loss_df, aes(NrOfDependants, EAD1)) + geom_point()
ggplot(loss_df, aes(ExistingLiabilities, EAD1)) + geom_point()
It also seems that union of several HomeOwnershipTypes can be useful, because some of them quite similar.
ggplot(loss_df, aes(HomeOwnershipType, EAD1)) + geom_boxplot()
Task 2
Use the data from Loan.csv
.
The aim is to predict probability of default of new applications.
loan_df_num <- read_delim("Loan.csv", delim=",")
## Rows: 7141 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Gender, NrOfDependants, Education, ExistingLiabilities, HomeOwnersh...
## lgl (1): default
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(loan_df_num)
## # A tibble: 6 × 6
## Gender NrOfDependants Education ExistingLiabilities HomeOwnershipType default
## <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1 0 4 8 1 TRUE
## 2 0 1 4 7 8 TRUE
## 3 0 0 2 1 4 TRUE
## 4 0 1 4 9 6 FALSE
## 5 0 0 4 7 3 FALSE
## 6 1 0 4 7 9 FALSE
Note that again Education, Gender and HomeOwnershipType are actually nominal variables.
loan_df <- loan_df_num %>%
mutate(
Education = as.factor(Education),
Gender = as.factor(Gender),
HomeOwnershipType = as.factor(HomeOwnershipType)
)
Logistic regression model
Fit the best logistic regression model you can find in reasonable amount of time and describe what you tried and why the final model is the best of the ones you compared.
model <- glm(default ~ ., data = loan_df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = loan_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3764 -0.9881 -0.8944 1.3319 1.8017
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.031356 0.268926 -0.117 0.90718
## Gender1 -0.089752 0.050699 -1.770 0.07668 .
## Gender2 0.443773 0.167528 2.649 0.00807 **
## NrOfDependants 0.035874 0.025329 1.416 0.15669
## Education2 -0.132447 0.273110 -0.485 0.62771
## Education3 -0.087568 0.269528 -0.325 0.74526
## Education4 -0.366691 0.266583 -1.376 0.16897
## Education5 -0.513985 0.268264 -1.916 0.05537 .
## ExistingLiabilities -0.014910 0.007474 -1.995 0.04607 *
## HomeOwnershipType2 0.046811 0.074251 0.630 0.52841
## HomeOwnershipType3 0.019163 0.080126 0.239 0.81098
## HomeOwnershipType4 -0.040422 0.088840 -0.455 0.64911
## HomeOwnershipType5 0.412402 0.191561 2.153 0.03133 *
## HomeOwnershipType6 -0.050029 0.135264 -0.370 0.71149
## HomeOwnershipType7 -0.351173 0.103841 -3.382 0.00072 ***
## HomeOwnershipType8 -0.217967 0.084712 -2.573 0.01008 *
## HomeOwnershipType9 -0.610535 0.201174 -3.035 0.00241 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9504.9 on 7140 degrees of freedom
## Residual deviance: 9398.1 on 7124 degrees of freedom
## AIC: 9432.1
##
## Number of Fisher Scoring iterations: 4
I tried other models, but they were even worse.
Produce a box plot, which shows predicted probabilities for defaulting loans and for non-defaulting loans.
ggplot(augment(model, loan_df, type.predict="response")) +
geom_boxplot(aes(x=default, y=.fitted))
Produce confusion matrix when using the model for classification with default cutoff.
augment(model, type.predict = "response") %>%
mutate(
truth = factor(default),
prediction = factor(if_else(.fitted > 0.5, TRUE, FALSE))
) %>%
conf_mat(truth, prediction)
## Truth
## Prediction FALSE TRUE
## FALSE 4347 2672
## TRUE 59 63
Can defaulting loans be accurately predicted?
NO
LDA
Fit LDA model to the data using all variables as numeric variables and produce box plot of probabilities of default for defaulting loans and non-defaulting loans.
(predicted probabilities for classes can be obtained by predict(model)$posterior)
model_LDA = MASS::lda(default ~ ., data = loan_df_num)
pred = predict(model_LDA)$posterior[7142:14282]
ggplot(data = loan_df_num) + geom_boxplot(aes(loan_df_num$default, pred))
Compute confusion matrix for predictions of LDA model
#predict(model_LDA)$class
loan_df_num %>%
mutate(
truth = factor(default),
prediction = predict(model_LDA)$class
) %>%
conf_mat(truth, prediction)
## Truth
## Prediction FALSE TRUE
## FALSE 4385 2719
## TRUE 21 16