The Basel II/III risk models are must-know for the practitioners of credit risk management in bank industry. For the two key types of Basel II risk models: the Probability of Default (PD) model and the Loss Given Default (LGD) model, there are several statistical measures to validate the stability, performance, and calibration of these two models.

In general, the model validation measures can be grouped into the following three categories:

Category | Description |
---|---|

Model Stability | Tracks the change in distribution of the modeling data and scoring data. |

Model Performance | Measures the ability of a model to discriminate between customers with accounts that have defaulted, and customers with accounts that have not defaulted. The score difference between non-default and default accounts helps to determine the required cutoff score. The cutoff score helps to predict whether a credit exposure is a default account. Measures the relationship between the actual default probability and the predicted default probability. This helps you to understand the performance of a model over a time period. |

Model Calibration | Checks the accuracy of the PD and LGD models by comparing the correct quantification of the risk components with the available standards. |

The user’s guide for SAS model manager tool provides a comprehensive list of the statistical measures which are included in its Basel II report product. Although the features are exclusive to the SAS users who have subscriptions to this product, I review this list as a good reference for those freshers who may be struggling to pick the “right” statistical measure. Here I copied the same table from SAS’s support site. The tables include the name of the statistical measures, a simple description and the existence in either PD report or LGD report which can be a good guideline on making our own choices . And I will spend some time adding more detailed description and will also include the example of realization in multiple programming languages.

The following tables describe the statistical tests that cover the measurement of model stability, model performance and model calibration.

## Model Stability Measure

The following table describes the model stability measure that is used to create the PD report and the LGD reports.Measure | Description | PD Report | LGD Report |
---|---|---|---|

System Stability Index (SSI) | SSI monitors the score distribution over a time period. | Yes | Yes |

## Model Performance Measures and Statistics

The following table describes the model performance measures that are used to create the PD and LGD reports.Measure | Description | PD Report | LGD Report |
---|---|---|---|

Accuracy | Accuracy is the proportion of the total number of predictions that were correct. | Yes | No |

Accuracy Ratio (AR) | AR is the summary index of Cumulative Accuracy Profile (CAP) and is also known as Gini coefficient. It shows the performance of the model that is being evaluated by depicting the percentage of defaulted accounts that are captured by the model across different scores. | Yes | Yes |

Area Under Curve (AUC) | AUC can be interpreted as the average ability of the rating model to accurately classify non-default accounts and default accounts. It represents the discrimination between the two populations. A higher area denotes higher discrimination. When AUC is 0.5, it means that non-default accounts and default accounts are randomly classified, and when AUC is 1, it means that the scoring model accurately classifies non-default accounts and default accounts. Thus, the AUC ranges between 0.5 and 1. | Yes | No |

Bayesian Error Rate (BER) | BER is the proportion of the whole sample that is misclassified when the rating system is in optimal use. For a perfect rating model, the BER has a value of zero. A model's BER depends on the probability of default. The lower the BER, and the lower the classification error, the better the model. | Yes | No |

D Statistic | The D Statistic is the mean difference of scores between default accounts and non-default accounts, weighted by the relative distribution of those scores. | Yes | No |

Error Rate | The Error Rate is the proportion of the total number of incorrect predictions. | Yes | No |

Information Statistic (I) | The Information Statistic value is a weighted sum of the difference between conditional default and conditional non-default rates. The higher the value, the more likely a model can predict a default account. | Yes | No |

Kendall’s Tau-b | Kendall's tau-b is a nonparametric measure of association based on the number of concordances and discordances in paired observations. Kendall's tau values range between -1 and +1, with a positive correlation indicating that the ranks of both variables increase together. A negative association indicates that as the rank of one variable increases, the rank of the other variable decreases. | Yes | No |

Kullback-Leibler Statistic (KL) | KL is a non-symmetric measure of the difference between the distributions of default accounts and non-default accounts. This score has similar properties to the information value. | Yes | No |

Kolmogorov-Smirnov Statistic (KS) | KS is the maximum distance between two population distributions. This statistic helps discriminate default accounts from non-default accounts. It is also used to determine the best cutoff in application scoring. The best cutoff maximizes KS, which becomes the best differentiator between the two populations. The KS value can range between 0 and 1, where 1 implies that the model is perfectly accurate in predicting default accounts or separating the two populations. A higher KS denotes a better model. | Yes | No |

1–PH Statistic (1–PH) | 1-PH is the percentage of cumulative non-default accounts for the cumulative 50% of the default accounts. | Yes | No |

Mean Square Error (MSE), Mean Absolute Deviation (MAD), and Mean Absolute Percent Error (MAPE) | MSE, MAD, and MAPE are generated for LGD reports. These statistics measure the differences between the actual LGD and predicted LGD. | No | Yes |

Pietra Index | The Pietra Index is a summary index of Receiver Operating Characteristic (ROC) statistics because the Pietra Index is defined as the maximum area of a triangle that can be inscribed between the ROC curve and the diagonal of the unit square. The Pietra Index can take values between 0 and 0.353. As a rating model's performance improves, the value is closer to 0.353. This expression is interpreted as the maximum difference between the cumulative frequency distributions of default accounts and non-default accounts. | Yes | No |

Precision | The Pietra Index is a summary index of Receiver Operating Characteristic (ROC) statistics because the Pietra Index is defined as the maximum area of a triangle that can be inscribed between the ROC curve and the diagonal of the unit square. The Pietra Index can take values between 0 and 0.353. As a rating model's performance improves, the value is closer to 0.353. This expression is interpreted as the maximum difference between the cumulative frequency distributions of default accounts and non-default accounts. | Yes | No |

Sensitivity | Sensitivity is the ability to correctly classify default accounts that have actually defaulted. | Yes | No |

Somers’ D (p-value) | Somers' D is a nonparametric measure of association that is based on the number of concordances and discordances in paired observations. It is an asymmetric modification of Kendall's tau. Somers' D differs from Kendall’s tau in that it uses a correction only for pairs that are tied on the independent variable. Values range between -1 and +1. A positive association indicates that the ranks for both variables increase together. A negative association indicates that as the rank of one variable increases, the rank of the other variable decreases. | Yes | No |

Specificity | Specificity is the ability to correctly classify non-default accounts that have not defaulted. | Yes | No |

Validation Score | The Validation Score is the average scaled value of seven distance measures, anchored to a scale of 1 to 13, lowest to highest. The seven measures are the mean difference (D), the percentage of cumulative non-default accounts for the cumulative 50% of the default accounts (1-PH), the maximum deviation (KS), the Gini coefficient (G), the Information Statistic (I), the Area Under the Curve (AUC), or Receiver Operating Characteristic (ROC) statistic, and the Kullback-Leibler statistic (KL). | Yes | No |

## Model Calibration Measures and Tests

The following table describes the model calibration measures and tests that are used to create the PD and LGD reports:Measure | Description | PD Report | LGD Report |
---|---|---|---|

Binomial Test | The Binomial Test evaluates whether the PD of a pool is underestimated. If the number of default accounts per pool exceeds either the low limit (binomial test at 0.95 confidence) or high limit (binomial test at 0.99 confidence), the test suggests that the model is poorly calibrated. | Yes | No |

Brier Skill Score (BSS) | BSS measures the accuracy of probability assessments at the account level. It measures the average squared deviation between predicted probabilities for a set of events and their outcomes. Therefore, a lower score represents a higher accuracy. | Yes | No |

Confidence Interval | The Confidence Interval indicates the confidence interval band of the PD or LGD for a pool.The Probability of Default report compares the actual and estimated PD rates with the CI limit of the estimate. If the estimated PD lies in the CI limits of the actual PD model, the PD performs better in estimating actual outcomes. For the Loss Given Default (LGD) report, confidence intervals are based on the pool-level average of the estimated LGD, plus or minus the pool-level standard deviation, and multiplied by the 1-(alpha/2) quantile of the standard normal distribution. | Yes | Yes |

Correlation Analysis | The model validation report for LGD provides a correlation analysis of the estimated LGD with the actual LGD. This correlation analysis is an important measure for a model’s usefulness. The Pearson correlation coefficients are provided at the pool and overall levels for each time period are examined. | No | Yes |

Hosmer-Lemeshow Test (p-value) | The Hosmer-Lemeshow test is a statistical test for goodness-of-fit for classification models. The test assesses whether the observed event rates match the expected event rates in pools. Models for which expected and observed event rates in pools are similar are well calibrated. The p-value of this test is a measure of the accuracy of the estimated default probabilities. The closer the p-value is to zero, the poorer the calibration of the model. | Yes | No |

Mean Absolute Deviation (MAD) | MAD is the distance between the account level estimated and the actual loss LGD, averaged at the pool level. | No | Yes |

Mean Absolute Percent Error (MAPE) | MAPE is the absolute value of the account-level difference between the estimated and the actual LGD, divided by the estimated LGD, and averaged at the pool level. | No | Yes |

Mean Squared Error (MSE) | MSE is the squared distance between the account level estimated and actual LGD, averaged at the pool level. | No | Yes |

Normal Test | The Normal Test compares the normalized difference of predicted and actual default rates per pool with two limits estimated over multiple observation periods. This test measures the pool stability over time. If a majority of the pools lie in the rejection region, to the right of the limits, then the pooling strategy should be revisited. | Yes | No |

Observed versus Estimated Index | The observed versus estimated index is a measure of closeness of the observed and estimated default rates. It measures the model's ability to predict default rates. The closer the index is to zero, the better the model performs in predicting default rates. | Yes | No |

Traffic Lights Test | The Traffic Lights Test evaluates whether the PD of a pool is underestimated, but unlike the binomial test, it does not assume that cross-pool performance is statistically independent. If the number of default accounts per pool exceeds either the low limit (Traffic Lights Test at 0.95 confidence) or high limit (Traffic Lights Test at 0.99 confidence), the test suggests the model is poorly calibrated. | Yes | No |

*Reference: SAS(R) Model Manager 12.1: User’s Guide*