Journal of Risk Model Validation

Risk.net

How accurate is the accuracy ratio in credit risk model validation?

Marco van der Burgt

  • This paper introduces several methods to calculate the sample variance in the accuracy ratio and the area under the curve.
  • The first method is based on numerical integration and gives the best estimate of the sample variance;
  • The method, based on assuming normally distributed scores, and the method, which free from score distribution assumptions, give reasonable estimates of the sample variance;
  • The accuracy ratio and area under the curve are normally distributed

The receiver operating curve and the cumulative accuracy profile visualize the ability of a credit scoring model to distinguish defaulting from nondefaulting counterparties. These curves lead to performance metrics such as the accuracy ratio and the area under the curve. Since these performance metrics are sample properties, we cannot draw firm conclusions on the model performance without knowing the sampling distribution or the sample variance. We present four methods to estimate the sample variance of the accuracy ratio and the area under the curve. The first method is based on numerical integration, the second and third methods assume specific score distributions, and the fourth method uses a correlation, leading to a distribution-independent equation for the sample variance. We demonstrate by simulations that the first method gives the best estimation of the sample variance. The distribution-independent equation gives reasonable estimations of the sample variance, but ignores higher-order effects that are distribution dependent.

1 Introduction

Recently, the European Central Bank (ECB) published their instructions for reporting the validation results of internal models for credit risk under the internal ratings-based approach (European Central Bank 2019). One of the key validation subjects in this reporting standard is the testing of the discriminatory power of credit ratings or scores. Discriminatory power is the ability to discriminate ex ante between defaulting and nondefaulting borrowers (Basel Committee on Banking Supervision 2005, Chapter 3). The analysis of discriminatory power aims to ensure that the ranking of customers by ratings or scores appropriately separates riskier and less risky customers.

The receiver operating curve (ROC) and the cumulative accuracy profile (CAP) are frequently used to visualize the discriminatory power of a credit scoring or credit rating model. The CAP depicts the cumulative percentage of all clients versus the cumulative percentage of defaulters. The ROC shows the performance of a credit scoring model at various score thresholds. This makes the ROC applicable in other disciplines such as machine learning, medical statistics, geosciences and biosciences (Krzanowski and Hand 2009). A numerical metric of discriminatory power is the area under the curve (AUC), which can be derived from the ROC and, indirectly, from the CAP. Another metric is the accuracy ratio (AR), which relates to the AUC as AR=2AUC-1, as explained in the next section.

The ECB instructions frequently refer to the AUC as a metric of discriminatory power. One of the tests in the instructions is the comparison between the AUC at the time of initial validation and the AUC at the end of the relevant observation period. These tests require the standard error in the observed AUC, which is greatly influenced by the number of defaults (Stein 2007). Data flaws, such as coupling information of a defaulted counterparty to a nondefault status, lead directly to an underestimated AR (Stein 2016; Russell et al 2012). Further, metrics such as AUC and AR depend on the portfolio, and comparing these metrics over time or between different portfolios might be misleading. Several authors give the variance or standard error in the AUC in terms of the Mann–Whitney U-statistics (Engelmann et al 2003; Basel Committee on Banking Supervision 2005, Chapter 3; European Central Bank 2019). However, these calculations are cumbersome, as they require a score or rating comparison of every defaulting counterparty with two nondefaulting counterparties. For a loan portfolio of 10 000 obligors and a default rate of 1%, this already leads to on the order of 1010 comparisons.

We deduce closed-form equations for the sample variance in the observed AR and AUC, starting in Section 2 with a brief description of the CAP and the ROC and their relation to metrics such as AR and AUC. Sections 3 and 4 introduce equations for the sample variance in the AR and the AUC, based on numerical integration and on assuming specific score distributions of the defaulting and nondefaulting counterparties. Section 5 presents an equation for the sample variance without specific distribution assumptions. Then, in Section 6, we demonstrate by simulations how the observed AR is distributed and which method provides the best estimation of the sample variance in the AR or AUC. Section 7 concludes.

2 Metrics of discriminatory power

This section describes the ROC and the CAP, also called the power curves (Engelmann et al 2003; Tasche 2010; Krzanowski and Hand 2009). To demonstrate these curves, we assume a credit scoring model, generating credit scores that increase with decreasing default risk: the higher the score S, the higher the credit quality as perceived by the lender and the lower the default probability. We also assume a continuous credit score of sufficient granularity such that the probability of obtaining ties can be neglected.

ROC curve for a credit scoring model with perfect discriminatory power, with moderate discriminatory power, with no discriminatory power and with inversion.
Figure 1: ROC curve for a credit scoring model with perfect discriminatory power, with moderate discriminatory power, with no discriminatory power and with inversion.

The CAP and the ROC are constructed by first arranging the counterparties by increasing score S, ie, from high default risk to low default risk. The ROC is a concave curve that results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of nondefaulting counterparties. Figure 1 shows ROCs for credit scoring systems with high, moderate or no discriminatory power. The more concave the ROC is, the better the discriminatory power of the underlying credit scores from which it is constructed. Defining y=Fd(S) as the cumulative percentage of defaulting counterparties with a score lower than S, and x=Fnd(S) as the cumulative percentage of nondefaulting counterparties with a score lower than S, the ROC follows from

  y=Fd(Fnd-1(x)).   (2.1)

An alternative to the ROC is the CAP, which results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of counterparties. Given the portfolio default probability P, we can convert the ROC into the CAP by applying a linear transformation xPx+(1-P)y of the horizontal axis. Since the number of defaults is much lower than the number of nondefaults for a common credit portfolio, the CAP and the ROC have a similar concave shape. Both curves are monotonously increasing functions and visualize the discriminatory power of a credit scoring system, but they have their advantages and disadvantages.

  • The CAP has the attractive property that its slope relates directly to the probability of default, which makes the CAP applicable for calibration of credit rating and scoring models (van der Burgt 2008, 2019).

  • The ROC is useful in decision analysis, as it relates to the type I and type II errors: a type I error means that a counterparty defaults unexpectedly, and a type II error means that a counterparty does not default although a default was expected (Stein 2007; Tasche 2008). The area under the ROC gives the AUC and the AR directly. Further, as shown below, the area under the ROC has a simple interpretation, and the integration of x and y gives the score probabilities of defaulting and nondefaulting counterparties.

The ROC and the CAP visualize the discriminatory power of the credit scoring model. In this paper we focus on the ROC, because it relates directly to score distributions of defaulting and nondefaulting counterparties. To show this, we first introduce the following equations, which are derived in Appendix A online:

  P[Sd,1,,Sd,N<Snd<Sd,N+1,,Sd,N+M] =01yN(1-y)Mdx,   (2.2)
  P[Snd,1,,Snd,N<Sd<Snd,N+1,,Snd,N+M] =01xN(1-x)Mdy.   (2.3)

Equation (2.2) gives the probability that the scores Sd,1,,Sd,N of N defaulting counterparties are lower and the scores Sd,N+1,,Sd,N+M of M defaulting counterparties are higher than the score Snd of a nondefaulting counterparty. Equation (2.3) gives the probability that the scores Snd,1,,Snd,N of N nondefaulting counterparties are lower and the scores Snd,N+1,,Snd,N+M of M nondefaulting counterparties are higher than the score Sd of a defaulting counterparty. Both probabilities follow directly from integrating distributions x and y, which are used to construct the ROC. These equations are powerful in deriving performance metrics and their sample variance from the ROC. The area under the ROC, denoted by AUC, follows directly from (2.2) with N=1 and M=0:

  AUC=01ydx=P[SdSnd].   (2.4)

This equation shows that the AUC can be interpreted as the probability that the score Sd of a defaulting counterparty is lower than the score Snd of a nondefaulting counterparty (Krzanowski and Hand 2009). Equation (2.4) also suggests a closed form for the AUC when the distributions x and y are known. It also shows the following property of the ROC (Krzanowski and Hand 2009).

Property 2.1.

The ROC remains invariant if the credit scores undergo a strictly monotonous transformation, since

  x(s) =P[Snd<s]=P[ϕ(Snd)<ϕ(s)]  
and
  y(s) =P[Sd<s]=P[ϕ(Sd)<ϕ(s)]  

remain unaffected under a monotonous transformation ϕ.

The AUC also leads to another numerical metric of discriminatory power, the accuracy ratio:

  AR=2AUC-1AUC=AR+12.   (2.5)

The accuracy ratio is sometimes called the Gini index (Krzanowski and Hand 2009) and can also be derived from the area under the CAP (Tasche 2010). Combining (2.4) and (2.5) gives the following interpretation of AR:

  AR=P[SdSnd]-P[Sd>Snd],   (2.6)

which is also shown in Appendix A online by applying (2.2) and (2.3). Figure 1 shows the ROC for several AR and AUC values and reveals that AR[-1,1] and AUC[0,1].

  1. (1)

    When P[Sd<Snd]=1, the credit scoring model perfectly discriminates between defaults and nondefaults, giving AR=1 and AUC=1. The ROC resembles the line y=100% as shown in Figure 1.

  2. (2)

    If the credit scoring model has no discriminatory power, the number of nondefaulting counterparties increases proportionally with the number of defaulting counterparties and the ROC resembles the line y=x as shown by the dashed line in Figure 1. It means that AUC=P[SdSnd]=P[Sd>Snd]=12 and AR=0.

  3. (3)

    The dotted curve in Figure 1 shows an inversion: the credit scoring model discriminates between defaulting and nondefaulting counterparties, but it ranks the counterparties in terms of increasing risk rather than decreasing risk. In the case of a perfect inversion, we have P[SdSnd]=1, resulting in AUC=0 and AR=-1.

The AR and AUC measure the discriminatory power of a credit scoring model and are easy to interpret in terms of score probabilities of the defaulting and nondefaulting counterparties. But they are sample statistics for a portfolio or a sample of counterparties. We refer to these metrics as the sample AUC and sample AR, denoted by AUC^ and AR^, respectively. Their values will vary from sample to sample, whereas the true AUC and AR are often unknown. We often assume that the sample AUC and AR sufficiently approximate the true AUC and AR for a large sample or portfolio. Often conclusions on model performance are based on these sample statistics, but the sample AUC and sample AR do not make sense if the uncertainty in these numbers is unknown. For example, Iyer et al (2015) state that a 0.01 improvement in AUC is considered a noteworthy gain in the credit scoring industry. However, we can only perceive this as an improvement if the uncertainty in the AUC is less than 0.01. Ideally, we would like to know the distribution of all possible values AUC^ or AR^ under random sampling, or at least to have an indication of the sample variance of these estimations.

Stein (2007) argues that the sample variance depends strongly on the number of defaults. This is supported by the theoretical upper bound of the variance in the observed AUC^ (Tasche 2010),

  σAUC^2AUC(1-AUC)min[Nd,Nnd],   (2.7)

with Nd the number of defaults, Nnd the number of nondefaults and AUC the true value, which is unknown. Equation (2.7) shows that the minority class (the class with the fewest observations) of a sample will influence the variance most dramatically (Stein 2007).

Equation (2.7) only provides an upper bound on the sample variance in the AUC. The early approaches to estimating the sample variance in the AUC rely on the relation between the AUC and the Mann–Whitney U-statistics. This relation gives the following result for the sample variance (Hanley and McNeil 1982; Cortes and Mohri 2004; Krzanowski and Hand 2009; Wu et al 2016):

  σAUC^2=AUC(1-AUC)+(Nd-1)(Q1-AUC2)+(Nnd-1)(Q2-AUC2)NdNnd,   (2.8)

in which Q1 is the probability that the credit score model ranks two randomly chosen defaults lower than a nondefault and Q2 is the probability that the credit score model ranks two randomly chosen nondefaults higher than a default. The estimation of these probabilities can be quite cumbersome for a large credit portfolio; for example, the estimation of Q1 requires a comparison of the score Snd of every nondefaulting counterparty with the scores Sd1 and Sd2 of every pair of defaulting counterparties. Since there are Nnd nondefaulting counterparties and 12Nd(Nd-1) pairs of nondefaulting counterparties, this gives 12NndNd(Nd-1) comparisons. We also need 12NdNnd(Nnd-1) comparisons to calculate Q2. In the following sections, we present four alternatives we use to calculate the probabilities Q1 and Q2, based on numerical integration or parametric methods.

3 Sample variance of the area under the curve based on numerical integration

Equation (2.8) is our starting point to calculate the sample variance. We can use (2.2) and (2.3) to derive probabilities Q1 and Q2 from quantities y=Fd(S) and x=Fnd(S). Equation (2.2) gives the probability Q1 for N=2 and M=0:

  Q1=P[Sd1,Sd2<Snd]=01y2dx.   (3.1)

In the same way, (2.3) gives the probability Q2 for N=0 and M=2:

  Q2=P[Sd<Snd1,Snd2]=01(1-x)2dy.   (3.2)

Equations (3.1) and (3.2) are important results: they show how the probabilities Q1 and Q2, and therefore the sample variance σAUC^2, can be derived from the underlying score distributions of defaulting and nondefaulting counterparties. In general, these distributions are unknown, but we can use numerical integration, eg, the trapezium rule, to calculate these probabilities. The areas under the curves (x,y2) and (y,(1-x2)) give the probabilities Q1 and Q2 directly, in the same way as the AUC gives the probability P[SdSnd]. We will refer to this method as the numerical integration (NI) method. If we know the analytical functions for the defaulting and nondefaulting counterparties, we can use these functions to calculate the probabilities from (3.1) and (3.2) directly. In the next section, we derive equations for the sample variance, assuming specific score distributions x and y.

4 Sample variance of the area under the curve and accuracy ratio based on score distribution assumptions

In this section, we assume exponentially distributed scores that lead to the equation of Hanley and McNeil (1982) for the sample variance, and normally distributed scores that lead to the sample variance of the so-called binormal model (Krzanowski and Hand 2009). These distribution assumptions may be strong, but Property 2.1 states that the ROC is invariant under a monotonous transformation of the scores. Due to this transformation invariance, the equations as derived in this section may also be applicable to other score distributions that can be transformed into normal or exponential distributions.

Hanley and McNeil (1982) introduced equations for Q1 and Q2 in terms of the true (unobserved) AUC, assuming exponential score distributions. Here, we reproduce their results using (3.1) and (3.2), which assume that the scores of the defaulting and nondefaulting counterparties are exponentially distributed: x(s)=1-exp(-λs) and y(s)=1-exp(-μs). Under these assumptions, (2.2) gives

  AUC=01y(x)dx=01{1-(1-x)μ/λ}dx=μλ+μ.   (4.1)

Using (3.1) and (3.2), the distributions give the following probabilities:

  Q1 =01y2dx=2μ2(2μ+λ)(λ+μ)=2AUC21+AUC,   (4.2)
  Q2 =01(1-x)2dy=μ2λ+μ=AUC2-AUC,   (4.3)

where we used (4.1) to express the probabilities in the AUC. Hanley and McNeil (1982) proposed these probabilities to calculate the σAUC^2. Using these probabilities in (2.8) and combining with

  AR=2AUC-1andσAR^2=4σAUC^2,  

we find the following equation for the variance in the sample AR:

  σAR^2=1-AR2NdNnd{1+(Nd-1)1+AR3+AR+(Nnd-1)1-AR3-AR}.   (4.4)

We will refer to (4.4) as the Hanley–McNeil (HM) method.

Another parametric method to calculate the AUC is the binormal model. This model assumes that the scores of the nondefaulting and defaulting counterparties are normally distributed: x(s)=Φ[(s-μnd)/σnd] and y(s)=Φ[(s-μd)/σd]. Given these cumulative distributions, we can express y in x as

  y(x)=Φ[μnd-μdσd+σndσdΦ-1[x]]   (4.5)

and x in y as

  x(y)=Φ[μd-μndσnd+σdσndΦ-1[y]].   (4.6)

The AUC follows from (2.4):

  AUC=01ydx=Φ[μnd-μdσd2+σnd2].   (4.7)

Equation (4.7) was found earlier (Tasche 2010) and makes sense intuitively: AUC=12 when μnd=μd, as is the case for a nondiscriminatory scoring model. The AUC approaches 1 for μndμd, and 0 for μndμd. The probability Q1 follows from (3.1):

  Q1=01y2dx=Φ[μnd-μdσnd2+σd2]-2T[μnd-μdσnd2+σd2,σdσd2+2σnd2],   (4.8)

in which

  T[h,a]=12π0aexp{-12h2(1+x2)}1+x2dx  

is Owen’s T function (Patefield and Tandy 2000). The probability Q2 follows from (3.2):

  Q2 =01(1-x)2dy=1-201xdy+01x2dy  
    =Φ[μnd-μdσd2+σnd2]-2T[μnd-μdσnd2+σd2,σndσnd2+2σd2],   (4.9)

in which we used (4.6) to calculate the integrals. This shows how powerful (2.2) and (2.3) are: they lead to (4.8) and (4.9), and the integrals 01xdy and 01x2dy follow from (2.3). Using these probabilities Q1 and Q2 in (2.8) gives the sample variance in the AUC:

  σAUC^2 =AUC(1-AUC)Nnd+Nd-1NdNnd  
      -2{Nd-1NdNndT[Φ-1[AUC],σdσd2+2σnd2]  
            +Nnd-1NdNndT[Φ-1[AUC],σndσnd2+2σd2]}.   (4.10)

We obtain the following equation for the sample variance in the AR:

  σAR^2 =(1-AR2)Nnd+Nd-1NdNnd  
      -8NdNnd{(Nd-1)T[Φ-1[1+AR2],σdσd2+2σnd2]  
            +(Nnd-1)T[Φ-1[1+AR2],σndσnd2+2σd2]}.   (4.11)

We will refer to (4.11) as the binormal (BN) method. The HM and BN methods are valid for specific assumptions about the underlying distributions. However, these equations may still be applicable to other score distributions as well, due to the transformation invariance property of the ROC as given in Section 2. At least both methods align with our intuition, as they both give Q1=Q2=13 for a nondiscriminatory scoring model and Q1=Q2=1 for a perfect discriminatory scoring model. In the next section, we introduce a distribution-independent approach.

5 Sample variance of the area under the curve and accuracy ratio without distribution assumptions

Table 1: Overview of the possible events and their corresponding probabilities for AUC=12 (not discriminatory), AUC=1 (perfect discriminatory) or AUC=0 (inversion).
Event ???=?? ???=? ???=?
Q1=P[Sd1,Sd2<Snd] 13 1 0
P[Snd<Sd1,Sd2] 13 0 1
P[Sd1<Snd<Sd2orSd2<Snd<Sd1] 13 0 0
Total 1 1 1

In the previous section, we described methods for calculating Q1 and Q2. These methods have in common that the underlying score distributions y=Fd(S) and x=Fnd(S) need to be known. We can infer Q1 and Q2 without distribution assumptions in some cases. For example, Table 1 shows that the event [Sd1,Sd2<Snd] is one of three possible events. If the credit scoring model has no discriminatory power, then it is equally likely that all three events occur with a probability of 13: Q1=13 when AUC=12. When the scoring model discriminates perfectly between defaults and nondefaults, AUC=1 and Q1=1. Similarly, Q2=13 for AUC=12 and Q2=1 for AUC=1.

Table 1 can be used to calculate the correlation. The probability Q1=P[Sd1,Sd2<Snd] is the likelihood of two simultaneous events: Sd1<Snd and Sd2<Snd. Assuming independence gives P[Sd1,Sd2<Snd]=P[Sd1<Snd]P[Sd2<Snd]=AUC2. However, these events are correlated, since the scores of two different defaults are compared with the same score of a nondefaulting counterparty. This correlation follows from

  ρ=P[Sd1,Sd2<Snd]-P[Sd1<Snd]P[Sd2<Snd]P[Sd1<Snd](1-P[Sd1<Snd])P[Sd2<Snd](1-P[Sd2<Snd]).   (5.1)

Using (2.4) and writing P[Sd1,Sd2<Snd] explicitly gives

  P[Sd1,Sd2<Snd]=AUC2+ρ(AUC-AUC2).   (5.2)

To derive the event correlation ρ, we consider a scoring model with no discriminatory power. This means that AUC=P[Sd<Snd]=12. Table 1 shows that [Sd1,Sd2<Snd] is one of three possible events. Since the scoring model has no discriminatory power, it is equally likely that all three events occur with a probability of 13. This means that P[Sd1,Sd2<Snd]=(12)2+ρ(12-(12)2)=13, which gives ρ=13. Using this correlation in (5.2) gives

  Q1=P[Sd1,Sd2<Snd]=13AUC(1+2AUC).   (5.3)

A similar approach gives the same result for Q2, so that we have

  Q1=P[Sd1,Sd2<Snd]=Q2=P[Sd<Snd1,Snd2]=13AUC(1+2AUC).   (5.4)

Substituting (5.4) into (2.8) gives the sample variance in the observed AUC^:

  σAUC^2=Nnd+Nd+1NdNndAUC(1-AUC)3.   (5.5)

Equation (5.5) supports the view of Stein (2007), ie, the uncertainty in the AUC is mainly determined by the minority class. The sample variance σAUC^2 is high at a low default rate, in which case the number of defaults is low. Equation (5.5) can be written in the following form:

  σAUC^2{1Nd+1Nnd}AUC(1-AUC)3<AUC(1-AUC)min[Nd,Nnd],   (5.6)

which aligns with the upper bound in (2.7). The sample variance σAUC^2 is also high at a high default rate, in which case the number of nondefaults is low.

We can express (5.5) in probabilities using (2.4) and (5.4):

  σAUC^2=Nnd+Nd+1NdNnd{P[Sd<Snd1,Snd2]-P[Sd<Snd1]P[Sd<Snd2]}.   (5.7)

The AUC is known as a proper metric of discriminatory power, but it only varies between 0 and 1. The AR has a wider range than the AUC as it varies from -1 for inversion, through 0 for no discriminatory models, to 1 for perfect discriminatory models. Therefore, we prefer an equation for the sample variance in the AR. Using AR=2AUC-1 and σAR^2=4σAUC^2 gives an equation for the sample variance in the observed AR^:

  σAR^2=Nnd+Nd+1NdNnd1-AR23.   (5.8)

Since (5.8) is derived without any assumption regarding the underlying score distributions, we refer to this equation as the distribution-free (DF) method. The AUC and AR in (5.5) and (5.8) represent the true AUC and AR and not those observed in the data. The observations AUC^ and AR^ converge to their true values AUC and AR in the case of an infinite number of observations, since Nd and Nnd lead to

  σAUC^20andσAR^20.  

The DF method is free from the assumptions of the underlying score distributions, but this comes at the cost of ignoring higher-order terms. This becomes clear if we apply Taylor expansions to (4.4), giving

  σAR^2=1-AR2NdNnd{1+(Nd -1)(13+c1AR+c2AR2+)  
    +(Nnd-1)(13-c1AR+c2AR2+)}   (5.9)

with c1=29, c2=-227,. When we ignore these coefficients, (5.9) changes into the DF method. This gives equal probabilities (Q1=Q2) and an equation for σAR^2 that is symmetric around AR=0. The DF method does not account for possible nonlinear or asymmetric effects by excluding higher-order terms in AR. These terms require assumptions on the underlying score distributions, but they become small for |AR|<1. The next section demonstrates which method gives the most appropriate results.

6 Demonstration by simulations

We performed simulations in R/RStudio version 1.1.423 to investigate the distribution of the AR and to reveal which of the four methods (NI, HM, BN or DF) agrees best with the sample variance in the AR. In the simulations, we assume two situations.

Case 1 The scores of the counterparties in the credit portfolio vary between - and . We use the Cornish–Fisher expansion to simulate the score distribution:

  s=μ+σ{z+z2-13!SK+z3-3z4!EK-2z3-5z36SK2},   (6.1)

in which z is a standard normally distributed random number. The score distribution has a mean score μ=50, a standard deviation σ=28, a skewness SK=0.5 and an excess kurtosis EK=0.8. We consider a distribution with some skewness and kurtosis to be more realistic than a normal distribution. The conditional default probability Pc(s) is the default probability of a counterparty conditional on its score s. We assume a logit function for the conditional probability of default Pc(s).

Case 2 The score varies uniformly between a minimum score Smin=0 and a maximum score Smax=100. Further, the conditional probability of default Pc(s) is an exponential function of the score s.

In both cases, the default/nondefault state of a counterparty i with score s is simulated by a Bernoulli variable B[i;Pc(s)], which equals 1 (default) with probability Pc(s) and 0 (nondefault) with probability 1-Pc(s). We define N as the total number of counterparties, U as the Mann–Whitney U-statistics and P as the overall default rate of the portfolio. The variance σAUC2 in the AUC and the variance σU2 in the Mann–Whitney U-statistics are calculated from the simulations. We used 1000 simulations of a portfolio of 100 000 counterparties for cases 1 and 2.

The methods described above are based on the connection between the Mann–Whitney U-statistics and the AUC, given by (European Central Bank 2019)

  U=AUCNdNnd=AUCN2P(1-P).   (6.2)

Thus, we first compare the variance σAUC2 with the variance σU2/{N2P^(1-P^)}2, in which P^ represents the portfolio default rate, averaged over all simulations. However, the portfolio default rate P varies around P¯=1% in the simulations, introducing an extra variance σP2 in σU2:

  σU2={UAUC}2σAUC2+{UP}2σP2+2ρAUC,P{UAUCσAUC}{Upσp},   (6.3)

in which

  UAUC=N2P(1-P)  

and

  UP=(1-2P)N2AUC=(1-2P)UP(1-P).  

The correlation ρAUC,P varies generally between -0.05 and 0.05 in the simulations, and the standard deviation σP is 0.003. Although these quantities are quite small, we correct the variance σU2 by subtracting the two extra terms in (6.3). Figures 2 and 3 show that, after correction, the variance σU2/{N2P^(1-P^)}2 resembles the variance σAUC2 for different areas under the curve for both case 1 and case 2.

Next, we show by simulation whether the probabilities Q1 and Q2 align with those of the HM, BN and DF methods. The simulations confirm (3.1) and (3.2); numerical integration leads to the same probabilities as counting the events [Sd1,Sd2<Snd] and [Snd1,Snd2>Sd], but it requires less computation time. Therefore, we used the integrals in (3.1) and (3.2) to derive the probabilities from the simulations. These integrals were calculated with the trapezium rule.

Comparison of the variance ... with the variance ... for case 1.
Figure 2: Comparison of the variance σAUC2 with the variance σU2/{N2P(1-P)}2 for case 1.
Comparison of the variance ... with the variance ... for case 2.
Figure 3: Comparison of the variance σAUC2 with the variance σU2/{N2P(1-P)}2 for case 2.
Probabilities Q sub 1 as a function of the AUC, based on simulations and on the Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.
Figure 4: Probabilities Q1 as a function of the AUC, based on simulations and on the Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.
Probabilities Q sub 2 as a function of the AUC, based on simulations and on the Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.
Figure 5: Probabilities Q2 as a function of the AUC, based on simulations and on the Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.

Figures 4 and 5 compare the probabilities from simulations with their theoretical counterparts. The dots in Figure 4 are the probabilities Q1,sim, derived from the simulations, whereas the solid, dashed and dotted lines represent the probabilities as calculated by the HM method using (4.2), the BN method using (4.8) and the DF method using (5.4). Similarly, Figure 5 compares the probabilities Q2,sim, resulting from the simulations, with the probabilities as calculated by the HM, BN and DF methods. The figures reveal that the probabilities Q1,sim and Q2,sim are close to Q1 and Q2 as calculated by the DF, HM and BN methods. To make this more quantitative, we calculated the deviation between the simulated probabilities and the actual probabilities resulting from each method as the sum of their quadratic difference. Table 2 compares the deviations ΔQ1 and ΔQ2 for probabilities Q1 and Q2, respectively, for the HM, BN and DF methods and for cases 1 and 2. The table shows that the DF method gives the smallest deviations, except for probability Q2 in case 2, where the BN method gives the smallest deviation. We conclude that the DF and BN methods give more accurate estimations of probabilities Q1 and Q2 than the HM method.

Table 2: Deviation between probabilities Q1,sim and Q2,sim, based on simulations, and corresponding probabilities based on the DF, BN and HM methods.
(a) Deviation ΔQ1=(Q1,m-Q1,sim)2
  Case 1 Case 2
Distribution-free (m=DF) 0.0009 0.0007
Binormal (m=BN) 0.0011 0.0014
Hanley–McNeil (m=HM) 0.0031 0.0056
(b) Deviation ΔQ2=(Q2,m-Q2,sim)2
  Case 1 Case 2
Distribution-free (m=DF) 0.0005 0.0029
Binormal (m=BN) 0.0015 0.0007
Hanley–McNeil (m=HM) 0.0028 0.0057
Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 1.
Figure 6: Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 1.
Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 2.
Figure 7: Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 2.

We verified the NI, HM, BN and DF methods by performing 1000 simulations for a portfolio of 100 000 counterparties with a portfolio default rate P=1%. In each simulation, we construct an ROC and calculate AUC^ by the trapezium rule for integration. AR^ follows from AUC^ by AUC^=2AR^-1. We obtain the true AR by averaging AR^ over all 1000 simulations. Since the variance of the mean decreases with 1/n, the average AR^ closely approximates the true AR for n=1000 simulations. We performed these simulations for several accuracy ratios, ranging from -1 for an inverted credit scoring model to 1 for a credit scoring model with perfect discriminatory power.

Figure 6 presents the sample variances, resulting from the simulations and the different methods for case 1, as a function of the true AR: the black diamonds represent the sample variance in the AR resulting from the 1000 simulations; the white dots represent the sample variance of the NI method, calculated by (2.8), (3.1) and (3.2); the solid line represents the sample variance of the HM method as calculated by (4.4); the gray dashed line represents the sample variance of the BN method calculated by (4.11); the dotted line represents the sample variance of the DF method calculated by (5.8).

Figure 7 shows the same results for case 2. These figures show that the NI method closely resembles the sample variance σAR^2 as calculated from the simulations. The HM and BN methods result in a sample variance σAR^2 that is an asymmetric function of the real AR, whereas the DF method gives a symmetric function. We use the R2 as a measure of the extent to which the HM, BN or DF method fits the sample variance σAR^2, resulting from the simulations. Table 3 shows the R2 for the NI, HM, BN and DF methods. For both cases 1 and 2, the NI method gives the highest R2 and the HM method gives the lowest R2. Considering the parametric methods, the DF method gives the most accurate sample variance for case 1, whereas the BN method is most accurate for case 2. The DF method overestimates the sample variance and is therefore conservative in case 2.

Table 3: Values of R2 (in percent) for cases 1 and 2, based on the NI, DF, BN and HM methods.
Methodology Case 1 Case 2
Numerical integration 99.3 99.5
Distribution-free 96.2 83.2
Binormal 89.2 96.3
Hanley–McNeil 82.0 65.8

We consider case 1 to be more realistic than case 2, since it includes effects such as skewness and kurtosis. Case 2 assumes homogeneously distributed scores, which will not often occur in practice. Since the DF method shows the best performance in case 1, we decided to investigate this method further. We regressed

  3NdNndNnd+Nd+1σAR^2   (6.4)

against a0+a1AR+a2AR2, using the results for case 1. According to (5.8), a0=1, a1=0 and a2=-1.

Table 4: Regression coefficients of (6.4) against AR and AR2, with their 95% confidence interval and their standard error.
    Standard    
  Coefficient error ?-value ?-value (%)
Intercept (a0) -0.97±0.03 0.02 60.1 0.00
Coefficient AR (a1) -0.04±0.03 0.02 02.7 0.02
Coefficient AR2 (a2) -1.03±0.06 0.03 34.3 0.00
R2 of regression (%) 98.6      
Comparison of the probability density functions of the AR, based on 1000 simulations, with the normal distribution. The normal distribution is represented by the dashed line. (a) AR=-0.99. (b) AR=-0.36. (c) AR=-0.16. (d) AR=0.15. (e) AR=0.41. (f) AR=0.61. (g) AR=0.76. (h) AR=0.85. (i) AR=0.95.
Figure 8: Comparison of the probability density functions of the AR, based on 1000 simulations, with the normal distribution. The normal distribution is represented by the dashed line. (a) AR=-0.99. (b) AR=-0.36. (c) AR=-0.16. (d) AR=0.15. (e) AR=0.41. (f) AR=0.61. (g) AR=0.76. (h) AR=0.85. (i) AR=0.95.
Jarque--Bera statistics of several AR distributions, based on 1000 simulations of a portfolio that consists of 100,000 counterparties with a portfolio default rate of 1%, as a function of the mean AR.
Figure 9: Jarque–Bera statistics of several AR distributions, based on 1000 simulations of a portfolio that consists of 100 000 counterparties with a portfolio default rate of 1%, as a function of the mean AR.

Table 4 presents the regression results: a0 and a2 agree with these values within the 95% confidence interval, but a1 gives a small but significant nonzero contribution. This comes from the higher-order term as explained by (5.9) and ignored in the DF method. Including this term improves the R2 from 96.2% to 98.6%. We conclude that the sample variance σAR^2 relates to 1-AR2 to a large extent, but the DF method may overestimate the sample variance by ignoring asymmetric and nonlinear effects. In addition, Figure 7 shows that the sample variance deviates from a parabolic function of AR as a result of excluding higher-order terms.

We constructed probability density functions of the AR from the simulations for case 1. Figure 8 presents the probability density functions for several AR values. The dashed line represents a normal distribution with the same mean and standard deviation as the AR density. The AR densities resemble a normal distribution, except when the AR is close to 1 or -1. In these cases, the AR densities become skewed and the sample accuracy ratios are not normally distributed. We used the Jarque–Bera statistics to test the normality of the AR densities. The Jarque–Bera statistics are calculated by

  JB=Nsim(SK26+EK224)   (6.5)

with Nsim as the number of simulations and the symbols SK and EK denoting the skewness and excess kurtosis of the AR density, respectively. If the probability density functions are normal, the Jarque–Bera statistics are chi-squared distributed with two degrees of freedom. Based on a significance level of 5%, the null hypothesis of normality is rejected when the Jarque–Bera statistics exceed a threshold of χ22inv(0.95)=5.99. Figure 9 shows the Jarque–Bera statistics for the different densities versus the mean AR. The dashed line represents the threshold above which the null hypothesis of normality is rejected. The Jarque–Bera test rejects normality for AR<-0.95 and AR>0.95. We conclude that the probability density function of AR^ follows a normal distribution when |AR|<0.95.

7 Conclusions

Metrics such as the AR and the AUC measure the discriminatory power of a credit scoring model. However, these are often estimated for a sample of counterparties, and different samples may give different values. Firm conclusions on model performance can only be drawn if the sample distribution, or at least the sample variance, is known. Traditional equations for the sample variance are based on the equivalence with the Mann–Whitney statistics, but these calculations require a long computation time.

We derived equations for the sample variance in the AR and AUC using four methods: the NI method is based on numerical integration; the HM and BN methods assume specific score distributions; and the DF method does not depend on such an assumption, but excludes higher-order terms. Simulations show that the AR is normally distributed and the NI method gives the best estimation of the sample variance in the AUC or AR. However, the DF and BN methods provide closed-form equations, which fit the sample variance of the simulations quite well. The HM method performs worst in all cases. We conclude that the NI method is the preferred method to estimate the sample variance. The probabilities Q1 and Q2 are easily calculated from the areas under the curves (x,y2) and (y,(1-x2)), respectively, in the same way as the AUC is estimated from the ROC.

An advantage of the BN and DF methods is that they provide closed-form equations for the sample variance. It is hard to conclude which method is most appropriate. The DF method performs better than the BN method in case 1, which is considered more realistic than case 2. In case 2, the BN method is the most accurate and the DF method overestimates the variance.

We can apply the NI, DF or BN methods in hypothesis testing, since the observed AR^ follows a normal distribution. For example, we can define the following statistics with the DF method:

  zAR=(|AR-AR0|)((Nnd+Nd+1)(1-AR02)3NdNnd)-1/2   (7.1)

to test a null hypothesis such as the following.

(H0) The credit rating or scoring model has an accuracy ratio of AR0, ie, AR=AR0.

Based on the normality of the AR distribution, the p-value is calculated by p=1-Φ[zAR] and the null hypothesis is rejected when p<5%.

Recently, the ECB introduced a test for comparing the initial AUC versus the current AUC (European Central Bank 2019), in which they derive the sample variance from the Mann–Whitney statistics. Alternatively, we suggest the NI, DF or BN method to calculate the sample variance in the AUC. We presumed a credit scoring model, which generates continuous scores, and ties rarely occur, while the ECB introduces this test for credit ratings rather than credit scores. Credit ratings are discrete values and ties occur. Often, rating models are derived from scoring models with a continuous outcome, such as logistic regression or discriminant analysis. As such, the test described here can be applied to the credit scores before they are mapped to credit ratings. Recent research reveals that the information loss that results from mapping credit scores to credit ratings is low (van der Burgt 2019).

Declaration of interest

The author reports no conflict of interest. The author alone is responsible for the content and writing of the paper. The views expressed in this article are those of the author and do not necessarily reflect those of Nationale Nederlanden Group.

References

  • Basel Committee on Banking Supervision (2005). Studies on the validation of internal rating systems. Working Paper 14, Basel Committee on Banking Supervision, Bank for International Settlements. URL: http://www.bis.org/publ/bcbs_wp14.pdf.
  • Cortes, C., and Mohri, M. (2004). Confidence intervals for the area under the ROC curve. In Proceedings of the 17th International Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, Volume 17, pp. 305–312. MIT Press, Cambridge, MA.
  • Engelmann, B., Hayden, E., and Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion Paper 01/2003, Series 2: Banking and Financial Supervision, Deutsche Bundesbank.
  • European Central Bank (2019). Instructions for reporting the validation results of internal models: IRB Pillar I models for credit risk. Report, February, European Central Bank, Frankfurt. URL: https://bit.ly/3gTxMnM.
  • Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (https://doi.org/10.1148/radiology.143.1.7063747).
  • Iyer, R., Khwaja, A. I., Luttmer, E. F., and Shue, K. (2015). Screening peers softly: inferring the quality of small borrowers. Management Science 62(6), 1554–1577 (https://doi.org/10.1287/mnsc.2015.2181).
  • Krzanowski, W., and Hand, D. J. (2009). ROC Curves for Continuous Data. Monographs on Statistics and Applied Probability, Volume 111. Chapman & Hall/CRC Press, Boca Raton, FL.
  • Patefield, M., and Tandy, D. (2000). Fast and accurate calculation of Owen’s T function. Journal of Statistical Software 5(5), 1–25 (https://doi.org/10.18637/jss.v005.i05).
  • Russell, H., Tang, Q. K., and Dwyer, D. W. (2012). The effect of imperfect data on default prediction validation tests. The Journal of Risk Model Validation 6(1), 77–96 (https://doi.org/10.21314/JRMV.2012.085).
  • Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation. The Journal of Risk Model Validation 1(1), 77–113 (https://doi.org/10.21314/JRMV.2007.002).
  • Stein, R. M. (2016). Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC. Data Mining and Knowledge Discovery 30(4), 763–796 (https://doi.org/10.1007/s10618-015-0437-7).
  • Tasche, D. (2008). Validation of internal rating systems and PD estimates. In The Analytics of Risk Model Validation, Christodoulakis, G., and Satchell, S. (eds), pp. 169–196. Academic Press (https://doi.org/10.1016/b978-075068158-2.50014-7).
  • Tasche, D. (2010). Estimating discriminatory power and PD curves when the number of defaults is small. Preprint (arXiv:0905.3928v2). URL: http://arxiv.org/pdf/0905.3928.pdf.
  • van der Burgt, M. J. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).
  • van der Burgt, M. J. (2019). Calibration and mapping of credit scores by riding the cumulative accuracy profile. The Journal of Credit Risk 15(1), 1–25 (https://doi.org/10.21314/JCR.2018.240).
  • Wu, J. C., Martin, A. F., and Kacker, R. N. (2016). Validation of nonparametric two-sample bootstrap in ROC analysis on large datasets. Communications in Statistics: Simulation and Computation 45(5), 1689–1703 (https://doi.org/10.1080/03610918.2015.1065327).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to copy this content. Please contact info@risk.net to find out more.

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here