Journal of Risk Model Validation
ISSN:
1753-9579 (print)
1753-9587 (online)
Editor-in-chief: Steve Satchell
Need to know
- This paper introduces several methods to calculate the sample variance in the accuracy ratio and the area under the curve.
- The first method is based on numerical integration and gives the best estimate of the sample variance;
- The method, based on assuming normally distributed scores, and the method, which free from score distribution assumptions, give reasonable estimates of the sample variance;
- The accuracy ratio and area under the curve are normally distributed
Abstract
The receiver operating curve and the cumulative accuracy profile visualize the ability of a credit scoring model to distinguish defaulting from nondefaulting counterparties. These curves lead to performance metrics such as the accuracy ratio and the area under the curve. Since these performance metrics are sample properties, we cannot draw firm conclusions on the model performance without knowing the sampling distribution or the sample variance. We present four methods to estimate the sample variance of the accuracy ratio and the area under the curve. The first method is based on numerical integration, the second and third methods assume specific score distributions, and the fourth method uses a correlation, leading to a distribution-independent equation for the sample variance. We demonstrate by simulations that the first method gives the best estimation of the sample variance. The distribution-independent equation gives reasonable estimations of the sample variance, but ignores higher-order effects that are distribution dependent.
Introduction
1 Introduction
Recently, the European Central Bank (ECB) published their instructions for reporting the validation results of internal models for credit risk under the internal ratings-based approach (European Central Bank 2019). One of the key validation subjects in this reporting standard is the testing of the discriminatory power of credit ratings or scores. Discriminatory power is the ability to discriminate ex ante between defaulting and nondefaulting borrowers (Basel Committee on Banking Supervision 2005, Chapter 3). The analysis of discriminatory power aims to ensure that the ranking of customers by ratings or scores appropriately separates riskier and less risky customers.
The receiver operating curve (ROC) and the cumulative accuracy profile (CAP) are frequently used to visualize the discriminatory power of a credit scoring or credit rating model. The CAP depicts the cumulative percentage of all clients versus the cumulative percentage of defaulters. The ROC shows the performance of a credit scoring model at various score thresholds. This makes the ROC applicable in other disciplines such as machine learning, medical statistics, geosciences and biosciences (Krzanowski and Hand 2009). A numerical metric of discriminatory power is the area under the curve (), which can be derived from the ROC and, indirectly, from the CAP. Another metric is the accuracy ratio (), which relates to the AUC as , as explained in the next section.
The ECB instructions frequently refer to the AUC as a metric of discriminatory power. One of the tests in the instructions is the comparison between the AUC at the time of initial validation and the AUC at the end of the relevant observation period. These tests require the standard error in the observed , which is greatly influenced by the number of defaults (Stein 2007). Data flaws, such as coupling information of a defaulted counterparty to a nondefault status, lead directly to an underestimated (Stein 2016; Russell et al 2012). Further, metrics such as and depend on the portfolio, and comparing these metrics over time or between different portfolios might be misleading. Several authors give the variance or standard error in the AUC in terms of the Mann–Whitney -statistics (Engelmann et al 2003; Basel Committee on Banking Supervision 2005, Chapter 3; European Central Bank 2019). However, these calculations are cumbersome, as they require a score or rating comparison of every defaulting counterparty with two nondefaulting counterparties. For a loan portfolio of 10 000 obligors and a default rate of 1%, this already leads to on the order of comparisons.
We deduce closed-form equations for the sample variance in the observed and , starting in Section 2 with a brief description of the CAP and the ROC and their relation to metrics such as and . Sections 3 and 4 introduce equations for the sample variance in the AR and the AUC, based on numerical integration and on assuming specific score distributions of the defaulting and nondefaulting counterparties. Section 5 presents an equation for the sample variance without specific distribution assumptions. Then, in Section 6, we demonstrate by simulations how the observed is distributed and which method provides the best estimation of the sample variance in the AR or . Section 7 concludes.
2 Metrics of discriminatory power
This section describes the ROC and the CAP, also called the power curves (Engelmann et al 2003; Tasche 2010; Krzanowski and Hand 2009). To demonstrate these curves, we assume a credit scoring model, generating credit scores that increase with decreasing default risk: the higher the score , the higher the credit quality as perceived by the lender and the lower the default probability. We also assume a continuous credit score of sufficient granularity such that the probability of obtaining ties can be neglected.
The CAP and the ROC are constructed by first arranging the counterparties by increasing score , ie, from high default risk to low default risk. The ROC is a concave curve that results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of nondefaulting counterparties. Figure 1 shows ROCs for credit scoring systems with high, moderate or no discriminatory power. The more concave the ROC is, the better the discriminatory power of the underlying credit scores from which it is constructed. Defining as the cumulative percentage of defaulting counterparties with a score lower than , and as the cumulative percentage of nondefaulting counterparties with a score lower than , the ROC follows from
(2.1) |
An alternative to the ROC is the CAP, which results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of counterparties. Given the portfolio default probability , we can convert the ROC into the CAP by applying a linear transformation of the horizontal axis. Since the number of defaults is much lower than the number of nondefaults for a common credit portfolio, the CAP and the ROC have a similar concave shape. Both curves are monotonously increasing functions and visualize the discriminatory power of a credit scoring system, but they have their advantages and disadvantages.
- •
The CAP has the attractive property that its slope relates directly to the probability of default, which makes the CAP applicable for calibration of credit rating and scoring models (van der Burgt 2008, 2019).
- •
The ROC is useful in decision analysis, as it relates to the type I and type II errors: a type I error means that a counterparty defaults unexpectedly, and a type II error means that a counterparty does not default although a default was expected (Stein 2007; Tasche 2008). The area under the ROC gives the AUC and the AR directly. Further, as shown below, the area under the ROC has a simple interpretation, and the integration of and gives the score probabilities of defaulting and nondefaulting counterparties.
The ROC and the CAP visualize the discriminatory power of the credit scoring model. In this paper we focus on the ROC, because it relates directly to score distributions of defaulting and nondefaulting counterparties. To show this, we first introduce the following equations, which are derived in Appendix A online:
(2.2) | ||||
(2.3) |
Equation (2.2) gives the probability that the scores of defaulting counterparties are lower and the scores of defaulting counterparties are higher than the score of a nondefaulting counterparty. Equation (2.3) gives the probability that the scores of nondefaulting counterparties are lower and the scores of nondefaulting counterparties are higher than the score of a defaulting counterparty. Both probabilities follow directly from integrating distributions and , which are used to construct the ROC. These equations are powerful in deriving performance metrics and their sample variance from the ROC. The area under the ROC, denoted by , follows directly from (2.2) with and :
(2.4) |
This equation shows that the AUC can be interpreted as the probability that the score of a defaulting counterparty is lower than the score of a nondefaulting counterparty (Krzanowski and Hand 2009). Equation (2.4) also suggests a closed form for the AUC when the distributions and are known. It also shows the following property of the ROC (Krzanowski and Hand 2009).
Property 2.1.
The ROC remains invariant if the credit scores undergo a strictly monotonous transformation, since
and | ||||
remain unaffected under a monotonous transformation .
The also leads to another numerical metric of discriminatory power, the accuracy ratio:
(2.5) |
The accuracy ratio is sometimes called the Gini index (Krzanowski and Hand 2009) and can also be derived from the area under the CAP (Tasche 2010). Combining (2.4) and (2.5) gives the following interpretation of :
(2.6) |
which is also shown in Appendix A online by applying (2.2) and (2.3). Figure 1 shows the ROC for several and values and reveals that and .
- (1)
When , the credit scoring model perfectly discriminates between defaults and nondefaults, giving and . The ROC resembles the line as shown in Figure 1.
- (2)
If the credit scoring model has no discriminatory power, the number of nondefaulting counterparties increases proportionally with the number of defaulting counterparties and the ROC resembles the line as shown by the dashed line in Figure 1. It means that and .
- (3)
The dotted curve in Figure 1 shows an inversion: the credit scoring model discriminates between defaulting and nondefaulting counterparties, but it ranks the counterparties in terms of increasing risk rather than decreasing risk. In the case of a perfect inversion, we have , resulting in and .
The and measure the discriminatory power of a credit scoring model and are easy to interpret in terms of score probabilities of the defaulting and nondefaulting counterparties. But they are sample statistics for a portfolio or a sample of counterparties. We refer to these metrics as the sample and sample , denoted by and , respectively. Their values will vary from sample to sample, whereas the true and are often unknown. We often assume that the sample and sufficiently approximate the true and for a large sample or portfolio. Often conclusions on model performance are based on these sample statistics, but the sample and sample do not make sense if the uncertainty in these numbers is unknown. For example, Iyer et al (2015) state that a 0.01 improvement in is considered a noteworthy gain in the credit scoring industry. However, we can only perceive this as an improvement if the uncertainty in the AUC is less than 0.01. Ideally, we would like to know the distribution of all possible values or under random sampling, or at least to have an indication of the sample variance of these estimations.
Stein (2007) argues that the sample variance depends strongly on the number of defaults. This is supported by the theoretical upper bound of the variance in the observed (Tasche 2010),
(2.7) |
with the number of defaults, the number of nondefaults and the true value, which is unknown. Equation (2.7) shows that the minority class (the class with the fewest observations) of a sample will influence the variance most dramatically (Stein 2007).
Equation (2.7) only provides an upper bound on the sample variance in the AUC. The early approaches to estimating the sample variance in the AUC rely on the relation between the AUC and the Mann–Whitney -statistics. This relation gives the following result for the sample variance (Hanley and McNeil 1982; Cortes and Mohri 2004; Krzanowski and Hand 2009; Wu et al 2016):
(2.8) |
in which is the probability that the credit score model ranks two randomly chosen defaults lower than a nondefault and is the probability that the credit score model ranks two randomly chosen nondefaults higher than a default. The estimation of these probabilities can be quite cumbersome for a large credit portfolio; for example, the estimation of requires a comparison of the score of every nondefaulting counterparty with the scores and of every pair of defaulting counterparties. Since there are nondefaulting counterparties and pairs of nondefaulting counterparties, this gives comparisons. We also need comparisons to calculate . In the following sections, we present four alternatives we use to calculate the probabilities and , based on numerical integration or parametric methods.
3 Sample variance of the area under the curve based on numerical integration
Equation (2.8) is our starting point to calculate the sample variance. We can use (2.2) and (2.3) to derive probabilities and from quantities and . Equation (2.2) gives the probability for and :
(3.1) |
In the same way, (2.3) gives the probability for and :
(3.2) |
Equations (3.1) and (3.2) are important results: they show how the probabilities and , and therefore the sample variance , can be derived from the underlying score distributions of defaulting and nondefaulting counterparties. In general, these distributions are unknown, but we can use numerical integration, eg, the trapezium rule, to calculate these probabilities. The areas under the curves and give the probabilities and directly, in the same way as the AUC gives the probability . We will refer to this method as the numerical integration (NI) method. If we know the analytical functions for the defaulting and nondefaulting counterparties, we can use these functions to calculate the probabilities from (3.1) and (3.2) directly. In the next section, we derive equations for the sample variance, assuming specific score distributions and .
4 Sample variance of the area under the curve and accuracy ratio based on score distribution assumptions
In this section, we assume exponentially distributed scores that lead to the equation of Hanley and McNeil (1982) for the sample variance, and normally distributed scores that lead to the sample variance of the so-called binormal model (Krzanowski and Hand 2009). These distribution assumptions may be strong, but Property 2.1 states that the ROC is invariant under a monotonous transformation of the scores. Due to this transformation invariance, the equations as derived in this section may also be applicable to other score distributions that can be transformed into normal or exponential distributions.
Hanley and McNeil (1982) introduced equations for and in terms of the true (unobserved) , assuming exponential score distributions. Here, we reproduce their results using (3.1) and (3.2), which assume that the scores of the defaulting and nondefaulting counterparties are exponentially distributed: and . Under these assumptions, (2.2) gives
(4.1) |
Using (3.1) and (3.2), the distributions give the following probabilities:
(4.2) | ||||
(4.3) |
where we used (4.1) to express the probabilities in the AUC. Hanley and McNeil (1982) proposed these probabilities to calculate the . Using these probabilities in (2.8) and combining with
we find the following equation for the variance in the sample :
(4.4) |
We will refer to (4.4) as the Hanley–McNeil (HM) method.
Another parametric method to calculate the AUC is the binormal model. This model assumes that the scores of the nondefaulting and defaulting counterparties are normally distributed: and . Given these cumulative distributions, we can express in as
(4.5) |
and in as
(4.6) |
The follows from (2.4):
(4.7) |
Equation (4.7) was found earlier (Tasche 2010) and makes sense intuitively: when , as is the case for a nondiscriminatory scoring model. The AUC approaches for , and for . The probability follows from (3.1):
(4.8) |
in which
is Owen’s function (Patefield and Tandy 2000). The probability follows from (3.2):
(4.9) |
in which we used (4.6) to calculate the integrals. This shows how powerful (2.2) and (2.3) are: they lead to (4.8) and (4.9), and the integrals and follow from (2.3). Using these probabilities and in (2.8) gives the sample variance in the AUC:
(4.10) |
We obtain the following equation for the sample variance in the AR:
(4.11) |
We will refer to (4.11) as the binormal (BN) method. The HM and BN methods are valid for specific assumptions about the underlying distributions. However, these equations may still be applicable to other score distributions as well, due to the transformation invariance property of the ROC as given in Section 2. At least both methods align with our intuition, as they both give for a nondiscriminatory scoring model and for a perfect discriminatory scoring model. In the next section, we introduce a distribution-independent approach.
5 Sample variance of the area under the curve and accuracy ratio without distribution assumptions
Event | |||
---|---|---|---|
1 | 0 | ||
0 | 1 | ||
0 | 0 | ||
Total | 1 | 1 | 1 |
In the previous section, we described methods for calculating and . These methods have in common that the underlying score distributions and need to be known. We can infer and without distribution assumptions in some cases. For example, Table 1 shows that the event is one of three possible events. If the credit scoring model has no discriminatory power, then it is equally likely that all three events occur with a probability of : when . When the scoring model discriminates perfectly between defaults and nondefaults, and . Similarly, for and for .
Table 1 can be used to calculate the correlation. The probability is the likelihood of two simultaneous events: and . Assuming independence gives . However, these events are correlated, since the scores of two different defaults are compared with the same score of a nondefaulting counterparty. This correlation follows from
(5.1) |
Using (2.4) and writing explicitly gives
(5.2) |
To derive the event correlation , we consider a scoring model with no discriminatory power. This means that . Table 1 shows that is one of three possible events. Since the scoring model has no discriminatory power, it is equally likely that all three events occur with a probability of . This means that , which gives . Using this correlation in (5.2) gives
(5.3) |
A similar approach gives the same result for , so that we have
(5.4) |
Substituting (5.4) into (2.8) gives the sample variance in the observed :
(5.5) |
Equation (5.5) supports the view of Stein (2007), ie, the uncertainty in the AUC is mainly determined by the minority class. The sample variance is high at a low default rate, in which case the number of defaults is low. Equation (5.5) can be written in the following form:
(5.6) |
which aligns with the upper bound in (2.7). The sample variance is also high at a high default rate, in which case the number of nondefaults is low.
We can express (5.5) in probabilities using (2.4) and (5.4):
(5.7) |
The AUC is known as a proper metric of discriminatory power, but it only varies between 0 and 1. The AR has a wider range than the AUC as it varies from for inversion, through for no discriminatory models, to for perfect discriminatory models. Therefore, we prefer an equation for the sample variance in the AR. Using and gives an equation for the sample variance in the observed :
(5.8) |
Since (5.8) is derived without any assumption regarding the underlying score distributions, we refer to this equation as the distribution-free (DF) method. The and in (5.5) and (5.8) represent the true and and not those observed in the data. The observations and converge to their true values and in the case of an infinite number of observations, since and lead to
The DF method is free from the assumptions of the underlying score distributions, but this comes at the cost of ignoring higher-order terms. This becomes clear if we apply Taylor expansions to (4.4), giving
(5.9) |
with , . When we ignore these coefficients, (5.9) changes into the DF method. This gives equal probabilities () and an equation for that is symmetric around . The DF method does not account for possible nonlinear or asymmetric effects by excluding higher-order terms in AR. These terms require assumptions on the underlying score distributions, but they become small for . The next section demonstrates which method gives the most appropriate results.
6 Demonstration by simulations
We performed simulations in R/RStudio version 1.1.423 to investigate the distribution of the AR and to reveal which of the four methods (NI, HM, BN or DF) agrees best with the sample variance in the AR. In the simulations, we assume two situations.
Case 1 The scores of the counterparties in the credit portfolio vary between and . We use the Cornish–Fisher expansion to simulate the score distribution:
(6.1) |
in which is a standard normally distributed random number. The score distribution has a mean score , a standard deviation , a skewness and an excess kurtosis . We consider a distribution with some skewness and kurtosis to be more realistic than a normal distribution. The conditional default probability is the default probability of a counterparty conditional on its score . We assume a logit function for the conditional probability of default .
Case 2 The score varies uniformly between a minimum score and a maximum score . Further, the conditional probability of default is an exponential function of the score .
In both cases, the default/nondefault state of a counterparty with score is simulated by a Bernoulli variable , which equals (default) with probability and (nondefault) with probability . We define as the total number of counterparties, as the Mann–Whitney -statistics and as the overall default rate of the portfolio. The variance in the AUC and the variance in the Mann–Whitney -statistics are calculated from the simulations. We used 1000 simulations of a portfolio of 100 000 counterparties for cases 1 and 2.
The methods described above are based on the connection between the Mann–Whitney -statistics and the AUC, given by (European Central Bank 2019)
(6.2) |
Thus, we first compare the variance with the variance , in which represents the portfolio default rate, averaged over all simulations. However, the portfolio default rate varies around % in the simulations, introducing an extra variance in :
(6.3) |
in which
and
The correlation varies generally between and in the simulations, and the standard deviation is . Although these quantities are quite small, we correct the variance by subtracting the two extra terms in (6.3). Figures 2 and 3 show that, after correction, the variance resembles the variance for different areas under the curve for both case 1 and case 2.
Next, we show by simulation whether the probabilities and align with those of the HM, BN and DF methods. The simulations confirm (3.1) and (3.2); numerical integration leads to the same probabilities as counting the events and , but it requires less computation time. Therefore, we used the integrals in (3.1) and (3.2) to derive the probabilities from the simulations. These integrals were calculated with the trapezium rule.
Figures 4 and 5 compare the probabilities from simulations with their theoretical counterparts. The dots in Figure 4 are the probabilities , derived from the simulations, whereas the solid, dashed and dotted lines represent the probabilities as calculated by the HM method using (4.2), the BN method using (4.8) and the DF method using (5.4). Similarly, Figure 5 compares the probabilities , resulting from the simulations, with the probabilities as calculated by the HM, BN and DF methods. The figures reveal that the probabilities and are close to and as calculated by the DF, HM and BN methods. To make this more quantitative, we calculated the deviation between the simulated probabilities and the actual probabilities resulting from each method as the sum of their quadratic difference. Table 2 compares the deviations and for probabilities and , respectively, for the HM, BN and DF methods and for cases 1 and 2. The table shows that the DF method gives the smallest deviations, except for probability in case 2, where the BN method gives the smallest deviation. We conclude that the DF and BN methods give more accurate estimations of probabilities and than the HM method.
(a) Deviation | ||
---|---|---|
Case 1 | Case 2 | |
Distribution-free () | 0.0009 | 0.0007 |
Binormal () | 0.0011 | 0.0014 |
Hanley–McNeil () | 0.0031 | 0.0056 |
(b) Deviation | ||
Case 1 | Case 2 | |
Distribution-free () | 0.0005 | 0.0029 |
Binormal () | 0.0015 | 0.0007 |
Hanley–McNeil () | 0.0028 | 0.0057 |
We verified the NI, HM, BN and DF methods by performing 1000 simulations for a portfolio of 100 000 counterparties with a portfolio default rate %. In each simulation, we construct an ROC and calculate by the trapezium rule for integration. follows from by . We obtain the true by averaging over all 1000 simulations. Since the variance of the mean decreases with , the average closely approximates the true for simulations. We performed these simulations for several accuracy ratios, ranging from for an inverted credit scoring model to for a credit scoring model with perfect discriminatory power.
Figure 6 presents the sample variances, resulting from the simulations and the different methods for case 1, as a function of the true AR: the black diamonds represent the sample variance in the AR resulting from the 1000 simulations; the white dots represent the sample variance of the NI method, calculated by (2.8), (3.1) and (3.2); the solid line represents the sample variance of the HM method as calculated by (4.4); the gray dashed line represents the sample variance of the BN method calculated by (4.11); the dotted line represents the sample variance of the DF method calculated by (5.8).
Figure 7 shows the same results for case 2. These figures show that the NI method closely resembles the sample variance as calculated from the simulations. The HM and BN methods result in a sample variance that is an asymmetric function of the real AR, whereas the DF method gives a symmetric function. We use the as a measure of the extent to which the HM, BN or DF method fits the sample variance , resulting from the simulations. Table 3 shows the for the NI, HM, BN and DF methods. For both cases 1 and 2, the NI method gives the highest and the HM method gives the lowest . Considering the parametric methods, the DF method gives the most accurate sample variance for case 1, whereas the BN method is most accurate for case 2. The DF method overestimates the sample variance and is therefore conservative in case 2.
Methodology | Case 1 | Case 2 |
---|---|---|
Numerical integration | 99.3 | 99.5 |
Distribution-free | 96.2 | 83.2 |
Binormal | 89.2 | 96.3 |
Hanley–McNeil | 82.0 | 65.8 |
We consider case 1 to be more realistic than case 2, since it includes effects such as skewness and kurtosis. Case 2 assumes homogeneously distributed scores, which will not often occur in practice. Since the DF method shows the best performance in case 1, we decided to investigate this method further. We regressed
(6.4) |
against , using the results for case 1. According to (5.8), , and .
Standard | ||||
Coefficient | error | -value | -value (%) | |
Intercept () | 0.02 | 60.1 | 0.00 | |
Coefficient AR () | 0.02 | 2.7 | 0.02 | |
Coefficient AR () | 0.03 | 34.3 | 0.00 | |
of regression (%) | 98.6 |
Table 4 presents the regression results: and agree with these values within the 95% confidence interval, but gives a small but significant nonzero contribution. This comes from the higher-order term as explained by (5.9) and ignored in the DF method. Including this term improves the from 96.2% to 98.6%. We conclude that the sample variance relates to to a large extent, but the DF method may overestimate the sample variance by ignoring asymmetric and nonlinear effects. In addition, Figure 7 shows that the sample variance deviates from a parabolic function of as a result of excluding higher-order terms.
We constructed probability density functions of the AR from the simulations for case 1. Figure 8 presents the probability density functions for several values. The dashed line represents a normal distribution with the same mean and standard deviation as the AR density. The densities resemble a normal distribution, except when the AR is close to or . In these cases, the AR densities become skewed and the sample accuracy ratios are not normally distributed. We used the Jarque–Bera statistics to test the normality of the AR densities. The Jarque–Bera statistics are calculated by
(6.5) |
with as the number of simulations and the symbols and denoting the skewness and excess kurtosis of the AR density, respectively. If the probability density functions are normal, the Jarque–Bera statistics are chi-squared distributed with two degrees of freedom. Based on a significance level of 5%, the null hypothesis of normality is rejected when the Jarque–Bera statistics exceed a threshold of . Figure 9 shows the Jarque–Bera statistics for the different densities versus the mean . The dashed line represents the threshold above which the null hypothesis of normality is rejected. The Jarque–Bera test rejects normality for and . We conclude that the probability density function of follows a normal distribution when .
7 Conclusions
Metrics such as the AR and the AUC measure the discriminatory power of a credit scoring model. However, these are often estimated for a sample of counterparties, and different samples may give different values. Firm conclusions on model performance can only be drawn if the sample distribution, or at least the sample variance, is known. Traditional equations for the sample variance are based on the equivalence with the Mann–Whitney statistics, but these calculations require a long computation time.
We derived equations for the sample variance in the AR and AUC using four methods: the NI method is based on numerical integration; the HM and BN methods assume specific score distributions; and the DF method does not depend on such an assumption, but excludes higher-order terms. Simulations show that the AR is normally distributed and the NI method gives the best estimation of the sample variance in the AUC or AR. However, the DF and BN methods provide closed-form equations, which fit the sample variance of the simulations quite well. The HM method performs worst in all cases. We conclude that the NI method is the preferred method to estimate the sample variance. The probabilities and are easily calculated from the areas under the curves and , respectively, in the same way as the AUC is estimated from the ROC.
An advantage of the BN and DF methods is that they provide closed-form equations for the sample variance. It is hard to conclude which method is most appropriate. The DF method performs better than the BN method in case 1, which is considered more realistic than case 2. In case 2, the BN method is the most accurate and the DF method overestimates the variance.
We can apply the NI, DF or BN methods in hypothesis testing, since the observed follows a normal distribution. For example, we can define the following statistics with the DF method:
(7.1) |
to test a null hypothesis such as the following.
(H) The credit rating or scoring model has an accuracy ratio of , ie, .
Based on the normality of the AR distribution, the -value is calculated by and the null hypothesis is rejected when %.
Recently, the ECB introduced a test for comparing the initial AUC versus the current AUC (European Central Bank 2019), in which they derive the sample variance from the Mann–Whitney statistics. Alternatively, we suggest the NI, DF or BN method to calculate the sample variance in the AUC. We presumed a credit scoring model, which generates continuous scores, and ties rarely occur, while the ECB introduces this test for credit ratings rather than credit scores. Credit ratings are discrete values and ties occur. Often, rating models are derived from scoring models with a continuous outcome, such as logistic regression or discriminant analysis. As such, the test described here can be applied to the credit scores before they are mapped to credit ratings. Recent research reveals that the information loss that results from mapping credit scores to credit ratings is low (van der Burgt 2019).
Declaration of interest
The author reports no conflict of interest. The author alone is responsible for the content and writing of the paper. The views expressed in this article are those of the author and do not necessarily reflect those of Nationale Nederlanden Group.
References
- Basel Committee on Banking Supervision (2005). Studies on the validation of internal rating systems. Working Paper 14, Basel Committee on Banking Supervision, Bank for International Settlements. URL: http://www.bis.org/publ/bcbs_wp14.pdf.
- Cortes, C., and Mohri, M. (2004). Confidence intervals for the area under the ROC curve. In Proceedings of the 17th International Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, Volume 17, pp. 305–312. MIT Press, Cambridge, MA.
- Engelmann, B., Hayden, E., and Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion Paper 01/2003, Series 2: Banking and Financial Supervision, Deutsche Bundesbank.
- European Central Bank (2019). Instructions for reporting the validation results of internal models: IRB Pillar I models for credit risk. Report, February, European Central Bank, Frankfurt. URL: https://bit.ly/3gTxMnM.
- Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (https://doi.org/10.1148/radiology.143.1.7063747).
- Iyer, R., Khwaja, A. I., Luttmer, E. F., and Shue, K. (2015). Screening peers softly: inferring the quality of small borrowers. Management Science 62(6), 1554–1577 (https://doi.org/10.1287/mnsc.2015.2181).
- Krzanowski, W., and Hand, D. J. (2009). ROC Curves for Continuous Data. Monographs on Statistics and Applied Probability, Volume 111. Chapman & Hall/CRC Press, Boca Raton, FL.
- Patefield, M., and Tandy, D. (2000). Fast and accurate calculation of Owen’s function. Journal of Statistical Software 5(5), 1–25 (https://doi.org/10.18637/jss.v005.i05).
- Russell, H., Tang, Q. K., and Dwyer, D. W. (2012). The effect of imperfect data on default prediction validation tests. The Journal of Risk Model Validation 6(1), 77–96 (https://doi.org/10.21314/JRMV.2012.085).
- Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation. The Journal of Risk Model Validation 1(1), 77–113 (https://doi.org/10.21314/JRMV.2007.002).
- Stein, R. M. (2016). Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC. Data Mining and Knowledge Discovery 30(4), 763–796 (https://doi.org/10.1007/s10618-015-0437-7).
- Tasche, D. (2008). Validation of internal rating systems and PD estimates. In The Analytics of Risk Model Validation, Christodoulakis, G., and Satchell, S. (eds), pp. 169–196. Academic Press (https://doi.org/10.1016/b978-075068158-2.50014-7).
- Tasche, D. (2010). Estimating discriminatory power and PD curves when the number of defaults is small. Preprint (arXiv:0905.3928v2). URL: http://arxiv.org/pdf/0905.3928.pdf.
- van der Burgt, M. J. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).
- van der Burgt, M. J. (2019). Calibration and mapping of credit scores by riding the cumulative accuracy profile. The Journal of Credit Risk 15(1), 1–25 (https://doi.org/10.21314/JCR.2018.240).
- Wu, J. C., Martin, A. F., and Kacker, R. N. (2016). Validation of nonparametric two-sample bootstrap in ROC analysis on large datasets. Communications in Statistics: Simulation and Computation 45(5), 1689–1703 (https://doi.org/10.1080/03610918.2015.1065327).
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net