How accurate is the accuracy ratio in credit risk model validation?

Marco van der Burgt

1 Introduction

Recently, the European Central Bank (ECB) published their instructions for reporting the validation results of internal models for credit risk under the internal ratings-based approach (European Central Bank 2019). One of the key validation subjects in this reporting standard is the testing of the discriminatory power of credit ratings or scores. Discriminatory power is the ability to discriminate ex ante between defaulting and nondefaulting borrowers (Basel Committee on Banking Supervision 2005, Chapter 3). The analysis of discriminatory power aims to ensure that the ranking of customers by ratings or scores appropriately separates riskier and less risky customers.

The receiver operating curve (ROC) and the cumulative accuracy profile (CAP) are frequently used to visualize the discriminatory power of a credit scoring or credit rating model. The CAP depicts the cumulative percentage of all clients versus the cumulative percentage of defaulters. The ROC shows the performance of a credit scoring model at various score thresholds. This makes the ROC applicable in other disciplines such as machine learning, medical statistics, geosciences and biosciences (Krzanowski and Hand 2009). A numerical metric of discriminatory power is the area under the curve ( $\mathrm{AUC}$ ), which can be derived from the ROC and, indirectly, from the CAP. Another metric is the accuracy ratio ( $\mathrm{AR}$ ), which relates to the AUC as $\mathrm{AR}=2\mathrm{AUC}-1$ , as explained in the next section.

The ECB instructions frequently refer to the AUC as a metric of discriminatory power. One of the tests in the instructions is the comparison between the AUC at the time of initial validation and the AUC at the end of the relevant observation period. These tests require the standard error in the observed $\mathrm{AUC}$ , which is greatly influenced by the number of defaults (Stein 2007). Data flaws, such as coupling information of a defaulted counterparty to a nondefault status, lead directly to an underestimated $\mathrm{AR}$ (Stein 2016; Russell et al 2012). Further, metrics such as $\mathrm{AUC}$ and $\mathrm{AR}$ depend on the portfolio, and comparing these metrics over time or between different portfolios might be misleading. Several authors give the variance or standard error in the AUC in terms of the Mann–Whitney $U$ -statistics (Engelmann et al 2003; Basel Committee on Banking Supervision 2005, Chapter 3; European Central Bank 2019). However, these calculations are cumbersome, as they require a score or rating comparison of every defaulting counterparty with two nondefaulting counterparties. For a loan portfolio of 10 000 obligors and a default rate of 1%, this already leads to on the order of $10^{10}$ comparisons.

We deduce closed-form equations for the sample variance in the observed $\mathrm{AR}$ and $\mathrm{AUC}$ , starting in Section 2 with a brief description of the CAP and the ROC and their relation to metrics such as $\mathrm{AR}$ and $\mathrm{AUC}$ . Sections 3 and 4 introduce equations for the sample variance in the AR and the AUC, based on numerical integration and on assuming specific score distributions of the defaulting and nondefaulting counterparties. Section 5 presents an equation for the sample variance without specific distribution assumptions. Then, in Section 6, we demonstrate by simulations how the observed $\mathrm{AR}$ is distributed and which method provides the best estimation of the sample variance in the AR or $\mathrm{AUC}$ . Section 7 concludes.

2 Metrics of discriminatory power

This section describes the ROC and the CAP, also called the power curves (Engelmann et al 2003; Tasche 2010; Krzanowski and Hand 2009). To demonstrate these curves, we assume a credit scoring model, generating credit scores that increase with decreasing default risk: the higher the score $S$ , the higher the credit quality as perceived by the lender and the lower the default probability. We also assume a continuous credit score of sufficient granularity such that the probability of obtaining ties can be neglected.

Figure 1: ROC curve for a credit scoring model with perfect discriminatory power, with moderate discriminatory power, with no discriminatory power and with inversion.

The CAP and the ROC are constructed by first arranging the counterparties by increasing score $S$ , ie, from high default risk to low default risk. The ROC is a concave curve that results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of nondefaulting counterparties. Figure 1 shows ROCs for credit scoring systems with high, moderate or no discriminatory power. The more concave the ROC is, the better the discriminatory power of the underlying credit scores from which it is constructed. Defining $y=F_{\mathrm{d}}(S)$ as the cumulative percentage of defaulting counterparties with a score lower than $S$ , and $x=F_{\mathrm{nd}}(S)$ as the cumulative percentage of nondefaulting counterparties with a score lower than $S$ , the ROC follows from

y=F_{\mathrm{d}}(F_{\mathrm{nd}}^{-1}(x)).

(2.1)

An alternative to the ROC is the CAP, which results from plotting the cumulative percentage of defaulting counterparties versus the cumulative percentage of counterparties. Given the portfolio default probability $P$ , we can convert the ROC into the CAP by applying a linear transformation $x\to Px+(1-P)y$ of the horizontal axis. Since the number of defaults is much lower than the number of nondefaults for a common credit portfolio, the CAP and the ROC have a similar concave shape. Both curves are monotonously increasing functions and visualize the discriminatory power of a credit scoring system, but they have their advantages and disadvantages.

•

The CAP has the attractive property that its slope relates directly to the probability of default, which makes the CAP applicable for calibration of credit rating and scoring models (van der Burgt 2008, 2019).
•

The ROC is useful in decision analysis, as it relates to the type I and type II errors: a type I error means that a counterparty defaults unexpectedly, and a type II error means that a counterparty does not default although a default was expected (Stein 2007; Tasche 2008). The area under the ROC gives the AUC and the AR directly. Further, as shown below, the area under the ROC has a simple interpretation, and the integration of $x$ and $y$ gives the score probabilities of defaulting and nondefaulting counterparties.

The ROC and the CAP visualize the discriminatory power of the credit scoring model. In this paper we focus on the ROC, because it relates directly to score distributions of defaulting and nondefaulting counterparties. To show this, we first introduce the following equations, which are derived in Appendix A online:

	$\displaystyle P[S_{\mathrm{d},1},\dots,S_{\mathrm{d},N}<S_{\mathrm{nd}}<S_{% \mathrm{d},N+1},\dots,S_{\mathrm{d},N+M}]$	$\displaystyle=\int_{0}^{1}y^{N}(1-y)^{M}\mathrm{d}x,$		(2.2)
	$\displaystyle P[S_{\mathrm{nd},1},\dots,S_{\mathrm{nd},N}<S_{\mathrm{d}}<S_{% \mathrm{nd},N+1},\dots,S_{\mathrm{nd},N+M}]$	$\displaystyle=\int_{0}^{1}{x^{N}}{(1-x)^{M}}\mathrm{d}y.$		(2.3)

Equation (2.2) gives the probability that the scores $S_{\mathrm{d},1},\dots,S_{\mathrm{d},N}$ of $N$ defaulting counterparties are lower and the scores $S_{\mathrm{d},N+1},\dots,S_{\mathrm{d},N+M}$ of $M$ defaulting counterparties are higher than the score $S_{\mathrm{nd}}$ of a nondefaulting counterparty. Equation (2.3) gives the probability that the scores $S_{\mathrm{nd},1},\dots,S_{\mathrm{nd},N}$ of $N$ nondefaulting counterparties are lower and the scores $S_{\mathrm{nd},N+1},\dots,S_{\mathrm{nd},N+M}$ of $M$ nondefaulting counterparties are higher than the score $S_{\mathrm{d}}$ of a defaulting counterparty. Both probabilities follow directly from integrating distributions $x$ and $y$ , which are used to construct the ROC. These equations are powerful in deriving performance metrics and their sample variance from the ROC. The area under the ROC, denoted by $\mathrm{AUC}$ , follows directly from (2.2) with $N=1$ and $M=0$ :

\mathrm{AUC}=\int_{0}^{1}y\mathrm{d}x=P[S_{\mathrm{d}}\leq S_{\mathrm{nd}}].

(2.4)

This equation shows that the AUC can be interpreted as the probability that the score $S_{\mathrm{d}}$ of a defaulting counterparty is lower than the score $S_{\mathrm{nd}}$ of a nondefaulting counterparty (Krzanowski and Hand 2009). Equation (2.4) also suggests a closed form for the AUC when the distributions $x$ and $y$ are known. It also shows the following property of the ROC (Krzanowski and Hand 2009).

Property 2.1.

The ROC remains invariant if the credit scores undergo a strictly monotonous transformation, since

	$\displaystyle x(s)$	$\displaystyle=P[S_{\mathrm{nd}}<s]=P[\phi(S_{\mathrm{nd}})<\phi(s)]$
and
	$\displaystyle y(s)$	$\displaystyle=P[S_{\mathrm{d}}<s]=P[\phi(S_{\mathrm{d}})<\phi(s)]$

remain unaffected under a monotonous transformation $\phi$ .

The $\mathrm{AUC}$ also leads to another numerical metric of discriminatory power, the accuracy ratio:

\mathrm{AR}=2\mathrm{AUC}-1\iff\mathrm{AUC}=\frac{\mathrm{AR}+1}{2}.

(2.5)

The accuracy ratio is sometimes called the Gini index (Krzanowski and Hand 2009) and can also be derived from the area under the CAP (Tasche 2010). Combining (2.4) and (2.5) gives the following interpretation of $\mathrm{AR}$ :

\mathrm{AR}=P[S_{\mathrm{d}}\leq S_{\mathrm{nd}}]-P[S_{\mathrm{d}}>S_{\mathrm{% nd}}],

(2.6)

which is also shown in Appendix A online by applying (2.2) and (2.3). Figure 1 shows the ROC for several $\mathrm{AR}$ and $\mathrm{AUC}$ values and reveals that $\mathrm{AR}\in[-1,1]$ and $\mathrm{AUC}\in[0,1]$ .

(1)

When $P[S_{\mathrm{d}}<S_{\mathrm{nd}}]=1$ , the credit scoring model perfectly discriminates between defaults and nondefaults, giving $\mathrm{AR}=1$ and $\mathrm{AUC}=1$ . The ROC resembles the line $y=100\%$ as shown in Figure 1.
(2)

If the credit scoring model has no discriminatory power, the number of nondefaulting counterparties increases proportionally with the number of defaulting counterparties and the ROC resembles the line $y=x$ as shown by the dashed line in Figure 1. It means that $\mathrm{AUC}=P[S_{\mathrm{d}}\leq S_{\mathrm{nd}}]=P[S_{\mathrm{d}}>S_{\mathrm% {nd}}]=\tfrac{1}{2}$ and $\mathrm{AR}=0$ .
(3)

The dotted curve in Figure 1 shows an inversion: the credit scoring model discriminates between defaulting and nondefaulting counterparties, but it ranks the counterparties in terms of increasing risk rather than decreasing risk. In the case of a perfect inversion, we have $P[S_{\mathrm{d}}\geq S_{\mathrm{nd}}]=1$ , resulting in $\mathrm{AUC}=0$ and $\mathrm{AR}=-1$ .

The $\mathrm{AR}$ and $\mathrm{AUC}$ measure the discriminatory power of a credit scoring model and are easy to interpret in terms of score probabilities of the defaulting and nondefaulting counterparties. But they are sample statistics for a portfolio or a sample of counterparties. We refer to these metrics as the sample $\mathrm{AUC}$ and sample $\mathrm{AR}$ , denoted by $\widehat{\mathrm{AUC}}$ and $\widehat{\mathrm{AR}}$ , respectively. Their values will vary from sample to sample, whereas the true $\mathrm{AUC}$ and $\mathrm{AR}$ are often unknown. We often assume that the sample $\mathrm{AUC}$ and $\mathrm{AR}$ sufficiently approximate the true $\mathrm{AUC}$ and $\mathrm{AR}$ for a large sample or portfolio. Often conclusions on model performance are based on these sample statistics, but the sample $\mathrm{AUC}$ and sample $\mathrm{AR}$ do not make sense if the uncertainty in these numbers is unknown. For example, Iyer et al (2015) state that a 0.01 improvement in $\mathrm{AUC}$ is considered a noteworthy gain in the credit scoring industry. However, we can only perceive this as an improvement if the uncertainty in the AUC is less than 0.01. Ideally, we would like to know the distribution of all possible values $\widehat{\mathrm{AUC}}$ or $\widehat{\mathrm{AR}}$ under random sampling, or at least to have an indication of the sample variance of these estimations.

Stein (2007) argues that the sample variance depends strongly on the number of defaults. This is supported by the theoretical upper bound of the variance in the observed $\widehat{\mathrm{AUC}}$ (Tasche 2010),

\sigma_{\widehat{\mathrm{AUC}}}^{2}\leq\frac{\mathrm{AUC}(1-\mathrm{AUC})}{% \min[N_{\mathrm{d}},N_{\mathrm{nd}}]},

(2.7)

with $N_{\mathrm{d}}$ the number of defaults, $N_{\mathrm{nd}}$ the number of nondefaults and $\mathrm{AUC}$ the true value, which is unknown. Equation (2.7) shows that the minority class (the class with the fewest observations) of a sample will influence the variance most dramatically (Stein 2007).

Equation (2.7) only provides an upper bound on the sample variance in the AUC. The early approaches to estimating the sample variance in the AUC rely on the relation between the AUC and the Mann–Whitney $U$ -statistics. This relation gives the following result for the sample variance (Hanley and McNeil 1982; Cortes and Mohri 2004; Krzanowski and Hand 2009; Wu et al 2016):

\sigma_{\widehat{\mathrm{AUC}}}^{2}=\frac{\mathrm{AUC}(1-\mathrm{AUC})+(N_{% \mathrm{d}}-1)(Q_{1}-\mathrm{AUC}^{2})+(N_{\mathrm{nd}}-1)(Q_{2}-\mathrm{AUC}^% {2})}{N_{\mathrm{d}}N_{\mathrm{nd}}},

(2.8)

in which $Q_{1}$ is the probability that the credit score model ranks two randomly chosen defaults lower than a nondefault and $Q_{2}$ is the probability that the credit score model ranks two randomly chosen nondefaults higher than a default. The estimation of these probabilities can be quite cumbersome for a large credit portfolio; for example, the estimation of $Q_{1}$ requires a comparison of the score $S_{\mathrm{nd}}$ of every nondefaulting counterparty with the scores $S_{\mathrm{d}1}$ and $S_{\mathrm{d}2}$ of every pair of defaulting counterparties. Since there are $N_{\mathrm{nd}}$ nondefaulting counterparties and $\frac{1}{2}N_{\mathrm{d}}(N_{\mathrm{d}}-1)$ pairs of nondefaulting counterparties, this gives $\frac{1}{2}N_{\mathrm{nd}}N_{\mathrm{d}}(N_{\mathrm{d}}-1)$ comparisons. We also need $\frac{1}{2}N_{\mathrm{d}}N_{\mathrm{nd}}(N_{\mathrm{nd}}-1)$ comparisons to calculate $Q_{2}$ . In the following sections, we present four alternatives we use to calculate the probabilities $Q_{1}$ and $Q_{2}$ , based on numerical integration or parametric methods.

3 Sample variance of the area under the curve based on numerical integration

Equation (2.8) is our starting point to calculate the sample variance. We can use (2.2) and (2.3) to derive probabilities $Q_{1}$ and $Q_{2}$ from quantities $y=F_{\mathrm{d}}(S)$ and $x=F_{\mathrm{nd}}(S)$ . Equation (2.2) gives the probability $Q_{1}$ for $N=2$ and $M=0$ :

Q_{1}=P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=\int_{0}^{1}y^{2}% \mathrm{d}x.

(3.1)

In the same way, (2.3) gives the probability $Q_{2}$ for $N=0$ and $M=2$ :

Q_{2}=P[S_{\mathrm{d}}<S_{\mathrm{nd}1},S_{\mathrm{nd}2}]=\int_{0}^{1}(1-x)^{2% }\mathrm{d}y.

(3.2)

Equations (3.1) and (3.2) are important results: they show how the probabilities $Q_{1}$ and $Q_{2}$ , and therefore the sample variance $\smash{\sigma_{\widehat{\mathrm{AUC}}}^{2}}$ , can be derived from the underlying score distributions of defaulting and nondefaulting counterparties. In general, these distributions are unknown, but we can use numerical integration, eg, the trapezium rule, to calculate these probabilities. The areas under the curves $(x,y^{2})$ and $(y,(1-x^{2}))$ give the probabilities $Q_{1}$ and $Q_{2}$ directly, in the same way as the AUC gives the probability $P[S_{\mathrm{d}}\leq S_{\mathrm{nd}}]$ . We will refer to this method as the numerical integration (NI) method. If we know the analytical functions for the defaulting and nondefaulting counterparties, we can use these functions to calculate the probabilities from (3.1) and (3.2) directly. In the next section, we derive equations for the sample variance, assuming specific score distributions $x$ and $y$ .

4 Sample variance of the area under the curve and accuracy ratio based on score distribution assumptions

In this section, we assume exponentially distributed scores that lead to the equation of Hanley and McNeil (1982) for the sample variance, and normally distributed scores that lead to the sample variance of the so-called binormal model (Krzanowski and Hand 2009). These distribution assumptions may be strong, but Property 2.1 states that the ROC is invariant under a monotonous transformation of the scores. Due to this transformation invariance, the equations as derived in this section may also be applicable to other score distributions that can be transformed into normal or exponential distributions.

Hanley and McNeil (1982) introduced equations for $Q_{1}$ and $Q_{2}$ in terms of the true (unobserved) $\mathrm{AUC}$ , assuming exponential score distributions. Here, we reproduce their results using (3.1) and (3.2), which assume that the scores of the defaulting and nondefaulting counterparties are exponentially distributed: $x(s)=1-\exp(-\lambda s)$ and $y(s)=1-\exp(-\mu s)$ . Under these assumptions, (2.2) gives

\mathrm{AUC}=\int_{0}^{1}y(x)\mathrm{d}x=\int_{0}^{1}\{1-(1-x)^{\mu/\lambda}\}% \mathrm{d}x=\frac{\mu}{\lambda+\mu}.

(4.1)

Using (3.1) and (3.2), the distributions give the following probabilities:

	$\displaystyle Q_{1}$	$\displaystyle=\int_{0}^{1}y^{2}\mathrm{d}x=\frac{2\mu^{2}}{(2\mu+\lambda)(% \lambda+\mu)}=\frac{2\mathrm{AUC}^{2}}{1+\mathrm{AUC}},$		(4.2)
	$\displaystyle Q_{2}$	$\displaystyle=\int_{0}^{1}(1-x)^{2}\mathrm{d}y=\frac{\mu}{2\lambda+\mu}=\frac{% \mathrm{AUC}}{2-\mathrm{AUC}},$		(4.3)

where we used (4.1) to express the probabilities in the AUC. Hanley and McNeil (1982) proposed these probabilities to calculate the $\smash{\sigma_{\widehat{\mathrm{AUC}}}^{2}}$ . Using these probabilities in (2.8) and combining with

\mathrm{AR}=2\mathrm{AUC}-1\quad\text{and}\quad\sigma_{\widehat{\mathrm{AR}}}^% {2}=4\sigma_{\widehat{\mathrm{AUC}}}^{2},

we find the following equation for the variance in the sample $\mathrm{AR}$ :

\sigma_{\widehat{\mathrm{AR}}}^{2}=\frac{1-\mathrm{AR}^{2}}{N_{\mathrm{d}}N_{% \mathrm{nd}}}\bigg{\{}1+(N_{\mathrm{d}}-1)\frac{1+\mathrm{AR}}{3+\mathrm{AR}}+% (N_{\mathrm{nd}}-1)\frac{1-\mathrm{AR}}{3-\mathrm{AR}}\bigg{\}}.

(4.4)

We will refer to (4.4) as the Hanley–McNeil (HM) method.

Another parametric method to calculate the AUC is the binormal model. This model assumes that the scores of the nondefaulting and defaulting counterparties are normally distributed: $x(s)=\varPhi[(s-\mu_{\mathrm{nd}})/\sigma_{\mathrm{nd}}]$ and $y(s)=\varPhi[(s-\mu_{\mathrm{d}})/\sigma_{\mathrm{d}}]$ . Given these cumulative distributions, we can express $y$ in $x$ as

y(x)=\varPhi\bigg{[}\frac{\mu_{\mathrm{nd}}-\mu_{\mathrm{d}}}{\sigma_{\mathrm{% d}}}+\frac{\sigma_{\mathrm{nd}}}{\sigma_{\mathrm{d}}}\varPhi^{-1}[x]\bigg{]}

(4.5)

and $x$ in $y$ as

x(y)=\varPhi\bigg{[}\frac{\mu_{\mathrm{d}}-\mu_{\mathrm{nd}}}{\sigma_{\mathrm{% nd}}}+\frac{\sigma_{\mathrm{d}}}{\sigma_{\mathrm{nd}}}\varPhi^{-1}[y]\bigg{]}.

(4.6)

The $\mathrm{AUC}$ follows from (2.4):

\mathrm{AUC}=\int_{0}^{1}y\mathrm{d}x=\varPhi\bigg{[}\frac{\mu_{\mathrm{nd}}-% \mu_{\mathrm{d}}}{\sqrt{\sigma_{\mathrm{d}}^{2}+\sigma_{\mathrm{nd}}^{2}}}% \bigg{]}.

(4.7)

Equation (4.7) was found earlier (Tasche 2010) and makes sense intuitively: $\mathrm{AUC}=\tfrac{1}{2}$ when $\mu_{\mathrm{nd}}=\mu_{\mathrm{d}}$ , as is the case for a nondiscriminatory scoring model. The AUC approaches $1$ for $\mu_{\mathrm{nd}}\gg\mu_{\mathrm{d}}$ , and $0$ for $\mu_{\mathrm{nd}}\ll\mu_{\mathrm{d}}$ . The probability $Q_{1}$ follows from (3.1):

Q_{1}=\int_{0}^{1}y^{2}\mathrm{d}x=\varPhi\bigg{[}\frac{\mu_{\mathrm{nd}}-\mu_% {\mathrm{d}}}{\sqrt{\sigma_{\mathrm{nd}}^{2}+\sigma_{\mathrm{d}}^{2}}}\bigg{]}% -2T\bigg{[}\frac{\mu_{\mathrm{nd}}-\mu_{\mathrm{d}}}{\sqrt{\sigma_{\mathrm{nd}% }^{2}+\sigma_{\mathrm{d}}^{2}}},\frac{\sigma_{\mathrm{d}}}{\sqrt{\sigma_{% \mathrm{d}}^{2}+2\sigma_{\mathrm{nd}}^{2}}}\bigg{]},

(4.8)

in which

T[h,a]=\frac{1}{2\pi}\int_{0}^{a}\frac{\exp\{-\tfrac{1}{2}h^{2}(1+{x^{2}})\}}{% 1+x^{2}}\mathrm{d}x

is Owen’s $T$ function (Patefield and Tandy 2000). The probability $Q_{2}$ follows from (3.2):

	$\displaystyle Q_{2}$	$\displaystyle=\int_{0}^{1}(1-x)^{2}\mathrm{d}y=1-2\int_{0}^{1}x\mathrm{d}y+% \int_{0}^{1}x^{2}\mathrm{d}y$
		$\displaystyle=\varPhi\bigg{[}\frac{\mu_{\mathrm{nd}}-\mu_{\mathrm{d}}}{\sqrt{% \sigma_{\mathrm{d}}^{2}+\sigma_{\mathrm{nd}}^{2}}}\bigg{]}-2T\bigg{[}\frac{\mu% _{\mathrm{nd}}-\mu_{\mathrm{d}}}{\sqrt{\sigma_{\mathrm{nd}}^{2}+\sigma_{% \mathrm{d}}^{2}}},\frac{\sigma_{\mathrm{nd}}}{\sqrt{\sigma_{\mathrm{nd}}^{2}+2% \sigma_{\mathrm{d}}^{2}}}\bigg{]},$		(4.9)

in which we used (4.6) to calculate the integrals. This shows how powerful (2.2) and (2.3) are: they lead to (4.8) and (4.9), and the integrals $\int_{0}^{1}x\mathrm{d}y$ and $\int_{0}^{1}x^{2}\mathrm{d}y$ follow from (2.3). Using these probabilities $Q_{1}$ and $Q_{2}$ in (2.8) gives the sample variance in the AUC:

$\displaystyle\sigma_{\widehat{\mathrm{AUC}}}^{2}$	$\displaystyle=\mathrm{AUC}(1-\mathrm{AUC})\frac{N_{\mathrm{nd}}+N_{\mathrm{d}}% -1}{N_{\mathrm{d}}N_{\mathrm{nd}}}$
	$\displaystyle\qquad-2\bigg{\{}\frac{N_{\mathrm{d}}-1}{N_{\mathrm{d}}N_{\mathrm% {nd}}}T\bigg{[}\varPhi^{-1}[\mathrm{AUC}],\frac{\sigma_{\mathrm{d}}}{\sqrt{% \sigma_{\mathrm{d}}^{2}+2\sigma_{\mathrm{nd}}^{2}}}\bigg{]}$
	$\displaystyle\qquad\qquad\qquad\qquad+\frac{N_{\mathrm{nd}}-1}{N_{\mathrm{d}}N% _{\mathrm{nd}}}T\bigg{[}\varPhi^{-1}[\mathrm{AUC}],\frac{\sigma_{\mathrm{nd}}}% {\sqrt{\sigma_{\mathrm{nd}}^{2}+2\sigma_{\mathrm{d}}^{2}}}\bigg{]}\bigg{\}}.$	(4.10)

We obtain the following equation for the sample variance in the AR:

$\displaystyle\sigma_{\widehat{\mathrm{AR}}}^{2}$	$\displaystyle=(1-\mathrm{AR}^{2})\frac{N_{\mathrm{nd}}+N_{\mathrm{d}}-1}{N_{% \mathrm{d}}N_{\mathrm{nd}}}$
	$\displaystyle\qquad-\frac{8}{N_{\mathrm{d}}N_{\mathrm{nd}}}\bigg{\{}(N_{% \mathrm{d}}-1)T\bigg{[}\varPhi^{-1}\bigg{[}\frac{1+\mathrm{AR}}{2}\bigg{]},% \frac{\sigma_{\mathrm{d}}}{\sqrt{\sigma_{\mathrm{d}}^{2}+2\sigma_{\mathrm{nd}}% ^{2}}}\bigg{]}$
	$\displaystyle\qquad\qquad\qquad\qquad+(N_{\mathrm{nd}}-1)T\bigg{[}\varPhi^{-1}% \bigg{[}\frac{1+\mathrm{AR}}{2}\bigg{]},\frac{\sigma_{\mathrm{nd}}}{\sqrt{% \sigma_{\mathrm{nd}}^{2}+2\sigma_{\mathrm{d}}^{2}}}\bigg{]}\bigg{\}}.$	(4.11)

We will refer to (4.11) as the binormal (BN) method. The HM and BN methods are valid for specific assumptions about the underlying distributions. However, these equations may still be applicable to other score distributions as well, due to the transformation invariance property of the ROC as given in Section 2. At least both methods align with our intuition, as they both give $Q_{1}=Q_{2}=\frac{1}{3}$ for a nondiscriminatory scoring model and $Q_{1}=Q_{2}=1$ for a perfect discriminatory scoring model. In the next section, we introduce a distribution-independent approach.

5 Sample variance of the area under the curve and accuracy ratio without distribution assumptions

Table 1: Overview of the possible events and their corresponding probabilities for

\text{AUC}=\frac{\text{1}}{\text{2}}

(not discriminatory),

\text{AUC}=\text{1}

(perfect discriminatory) or

\text{AUC}=\text{0}

(inversion).

Event	$\textbf{AUC}=\frac{\text{1}}{\text{2}}$	$\textbf{AUC}=\text{1}$	$\textbf{AUC}=\text{0}$
$Q_{\text{1}}=P[{S_{\mathrm{d}\text{1}},S_{\mathrm{d}\text{2}}<S_{\mathrm{nd}}}]$	$\frac{\text{1}}{\text{3}}$	1	0
$P[S_{\mathrm{nd}}<S_{\mathrm{d}\text{1}},S_{\mathrm{d}\text{2}}]$	$\frac{\text{1}}{\text{3}}$	0	1
$P[S_{\mathrm{d}\text{1}}<S_{\mathrm{nd}}<S_{\mathrm{d}\text{2}}\text{or}S_{% \mathrm{d}\text{2}}<S_{\mathrm{nd}}<S_{\mathrm{d}\text{1}}]$	$\frac{\text{1}}{\text{3}}$	0	0
Total	1	1	1

In the previous section, we described methods for calculating $Q_{1}$ and $Q_{2}$ . These methods have in common that the underlying score distributions $y=F_{\mathrm{d}}(S)$ and $x=F_{\mathrm{nd}}(S)$ need to be known. We can infer $Q_{1}$ and $Q_{2}$ without distribution assumptions in some cases. For example, Table 1 shows that the event $[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]$ is one of three possible events. If the credit scoring model has no discriminatory power, then it is equally likely that all three events occur with a probability of $\frac{1}{3}$ : $Q_{1}=\frac{1}{3}$ when $\mathrm{AUC}=\frac{1}{2}$ . When the scoring model discriminates perfectly between defaults and nondefaults, $\mathrm{AUC}=1$ and $Q_{1}=1$ . Similarly, $Q_{2}=\frac{1}{3}$ for $\mathrm{AUC}=\frac{1}{2}$ and $Q_{2}=1$ for $\mathrm{AUC}=1$ .

Table 1 can be used to calculate the correlation. The probability $Q_{1}=P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]$ is the likelihood of two simultaneous events: $S_{\mathrm{d}1}<S_{\mathrm{nd}}$ and $S_{\mathrm{d}2}<S_{\mathrm{nd}}$ . Assuming independence gives $P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=P[S_{\mathrm{d}1}<S_{% \mathrm{nd}}]P[S_{\mathrm{d}2}<S_{\mathrm{nd}}]=\mathrm{AUC}^{2}$ . However, these events are correlated, since the scores of two different defaults are compared with the same score of a nondefaulting counterparty. This correlation follows from

\rho=\frac{P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]-P[S_{\mathrm{d}1% }<S_{\mathrm{nd}}]P[S_{\mathrm{d}2}<S_{\mathrm{nd}}]}{\sqrt{P[S_{\mathrm{d}1}<% S_{\mathrm{nd}}](1-P[S_{\mathrm{d}1}<S_{\mathrm{nd}}])P[S_{\mathrm{d}2}<S_{% \mathrm{nd}}](1-P[S_{\mathrm{d}2}<S_{\mathrm{nd}}])}}.

(5.1)

Using (2.4) and writing $P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]$ explicitly gives

P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=\mathrm{AUC}^{2}+\rho(% \mathrm{AUC}-\mathrm{AUC}^{2}).

(5.2)

To derive the event correlation $\rho$ , we consider a scoring model with no discriminatory power. This means that $\mathrm{AUC}=P[S_{\mathrm{d}}<S_{\mathrm{nd}}]=\frac{1}{2}$ . Table 1 shows that $[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]$ is one of three possible events. Since the scoring model has no discriminatory power, it is equally likely that all three events occur with a probability of $\frac{1}{3}$ . This means that $P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=(\frac{1}{2})^{2}+\rho(% \frac{1}{2}-(\frac{1}{2})^{2})=\frac{1}{3}$ , which gives $\rho=\frac{1}{3}$ . Using this correlation in (5.2) gives

Q_{1}=P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=\tfrac{1}{3}\mathrm{% AUC}(1+2\mathrm{AUC}).

(5.3)

A similar approach gives the same result for $Q_{2}$ , so that we have

Q_{1}=P[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]=Q_{2}=P[S_{\mathrm{d}% }<S_{\mathrm{nd}1},S_{\mathrm{nd}2}]=\tfrac{1}{3}\mathrm{AUC}(1+2\mathrm{AUC}).

(5.4)

Substituting (5.4) into (2.8) gives the sample variance in the observed $\widehat{\mathrm{AUC}}$ :

\sigma_{\widehat{\mathrm{AUC}}}^{2}=\frac{N_{\mathrm{nd}}+N_{\mathrm{d}}+1}{N_% {\mathrm{d}}N_{\mathrm{nd}}}\frac{\mathrm{AUC}(1-\mathrm{AUC})}{3}.

(5.5)

Equation (5.5) supports the view of Stein (2007), ie, the uncertainty in the AUC is mainly determined by the minority class. The sample variance $\smash{\sigma_{\widehat{\mathrm{AUC}}}^{2}}$ is high at a low default rate, in which case the number of defaults is low. Equation (5.5) can be written in the following form:

\sigma_{\widehat{\mathrm{AUC}}}^{2}\approx\bigg{\{}\frac{1}{N_{\mathrm{d}}}+% \frac{1}{N_{\mathrm{nd}}}\bigg{\}}\frac{\mathrm{AUC}(1-\mathrm{AUC})}{3}<\frac% {\mathrm{AUC}(1-\mathrm{AUC})}{\min[N_{\mathrm{d}},N_{\mathrm{nd}}]},

(5.6)

which aligns with the upper bound in (2.7). The sample variance $\smash{\sigma_{\widehat{\mathrm{AUC}}}^{2}}$ is also high at a high default rate, in which case the number of nondefaults is low.

We can express (5.5) in probabilities using (2.4) and (5.4):

\sigma_{\widehat{\mathrm{AUC}}}^{2}=\frac{N_{\mathrm{nd}}+N_{\mathrm{d}}+1}{N_% {\mathrm{d}}N_{\mathrm{nd}}}\{P[S_{\mathrm{d}}<S_{\mathrm{nd}1},S_{\mathrm{nd}% 2}]-P[S_{\mathrm{d}}<S_{\mathrm{nd}1}]P[{S_{\mathrm{d}}<S_{\mathrm{nd}2}}]\}.

(5.7)

The AUC is known as a proper metric of discriminatory power, but it only varies between 0 and 1. The AR has a wider range than the AUC as it varies from $-1$ for inversion, through $0$ for no discriminatory models, to $1$ for perfect discriminatory models. Therefore, we prefer an equation for the sample variance in the AR. Using $\smash{\mathrm{AR}=2\mathrm{AUC}-1}$ and $\smash{\sigma_{\widehat{\mathrm{AR}}}^{2}=4\sigma_{\widehat{\mathrm{AUC}}}^{2}}$ gives an equation for the sample variance in the observed $\smash{\widehat{\mathrm{AR}}}$ :

\sigma_{\widehat{\mathrm{AR}}}^{2}=\frac{N_{\mathrm{nd}}+N_{\mathrm{d}}+1}{N_{% \mathrm{d}}N_{\mathrm{nd}}}\frac{1-\mathrm{AR}^{2}}{3}.

(5.8)

Since (5.8) is derived without any assumption regarding the underlying score distributions, we refer to this equation as the distribution-free (DF) method. The $\mathrm{AUC}$ and $\mathrm{AR}$ in (5.5) and (5.8) represent the true $\mathrm{AUC}$ and $\mathrm{AR}$ and not those observed in the data. The observations $\smash{\widehat{\mathrm{AUC}}}$ and $\smash{\widehat{\mathrm{AR}}}$ converge to their true values $\mathrm{AUC}$ and $\mathrm{AR}$ in the case of an infinite number of observations, since $\smash{N_{\mathrm{d}}\to\infty}$ and $\smash{N_{\mathrm{nd}}\to\infty}$ lead to

\sigma_{\widehat{\mathrm{AUC}}}^{2}\to 0\quad\text{and}\quad\sigma_{\widehat{% \mathrm{AR}}}^{2}\to 0.

The DF method is free from the assumptions of the underlying score distributions, but this comes at the cost of ignoring higher-order terms. This becomes clear if we apply Taylor expansions to (4.4), giving

	$\displaystyle\sigma_{\widehat{\mathrm{AR}}}^{2}=\frac{1-\mathrm{AR}^{2}}{N_{% \mathrm{d}}N_{\mathrm{nd}}}\{1+(N_{\mathrm{d}}$	$\displaystyle-1)(\tfrac{1}{3}+c_{1}\mathrm{AR}+c_{2}\mathrm{AR}^{2}+\cdots)$
		$\displaystyle\quad{}+(N_{\mathrm{nd}}-1)(\tfrac{1}{3}-c_{1}\mathrm{AR}+c_{2}% \mathrm{AR}^{2}+\cdots)\}$		(5.9)

with $c_{1}=\frac{2}{9}$ , $c_{2}=-\frac{2}{27},\dots$ . When we ignore these coefficients, (5.9) changes into the DF method. This gives equal probabilities ( $Q_{1}=Q_{2}$ ) and an equation for $\sigma_{\widehat{\mathrm{AR}}}^{2}$ that is symmetric around $\mathrm{AR}=0$ . The DF method does not account for possible nonlinear or asymmetric effects by excluding higher-order terms in AR. These terms require assumptions on the underlying score distributions, but they become small for $|\mathrm{AR}|<1$ . The next section demonstrates which method gives the most appropriate results.

6 Demonstration by simulations

We performed simulations in R/RStudio version 1.1.423 to investigate the distribution of the AR and to reveal which of the four methods (NI, HM, BN or DF) agrees best with the sample variance in the AR. In the simulations, we assume two situations.

Case 1 The scores of the counterparties in the credit portfolio vary between $-\infty$ and $\infty$ . We use the Cornish–Fisher expansion to simulate the score distribution:

s=\mu+\sigma\bigg{\{}z+\frac{z^{2}-1}{3!}\mathrm{SK}+\frac{z^{3}-3z}{4!}% \mathrm{EK}-\frac{2z^{3}-5z}{36}\mathrm{SK}^{2}\bigg{\}},

(6.1)

in which $z$ is a standard normally distributed random number. The score distribution has a mean score $\mu=50$ , a standard deviation $\sigma=28$ , a skewness $\mathrm{SK}=0.5$ and an excess kurtosis $\mathrm{EK}=0.8$ . We consider a distribution with some skewness and kurtosis to be more realistic than a normal distribution. The conditional default probability $P_{\mathrm{c}}(s)$ is the default probability of a counterparty conditional on its score $s$ . We assume a logit function for the conditional probability of default $P_{\mathrm{c}}(s)$ .

Case 2 The score varies uniformly between a minimum score $S_{\min}=0$ and a maximum score $S_{\max}=100$ . Further, the conditional probability of default $P_{\mathrm{c}}(s)$ is an exponential function of the score $s$ .

In both cases, the default/nondefault state of a counterparty $i$ with score $s$ is simulated by a Bernoulli variable $B[i;P_{\mathrm{c}}(s)]$ , which equals $1$ (default) with probability $P_{\mathrm{c}}(s)$ and $0$ (nondefault) with probability $1-P_{\mathrm{c}}(s)$ . We define $N$ as the total number of counterparties, $U$ as the Mann–Whitney $U$ -statistics and $P$ as the overall default rate of the portfolio. The variance $\sigma_{\mathrm{AUC}}^{2}$ in the AUC and the variance $\sigma_{U}^{2}$ in the Mann–Whitney $U$ -statistics are calculated from the simulations. We used 1000 simulations of a portfolio of 100 000 counterparties for cases 1 and 2.

The methods described above are based on the connection between the Mann–Whitney $U$ -statistics and the AUC, given by (European Central Bank 2019)

U=\mathrm{AUC}N_{\mathrm{d}}N_{\mathrm{nd}}=\mathrm{AUC}N^{2}P(1-P).

(6.2)

Thus, we first compare the variance $\smash{\sigma_{\mathrm{AUC}}^{2}}$ with the variance $\smash{\sigma_{U}^{2}/{\{N^{2}\hat{P}(1-\hat{P})\}^{2}}}$ , in which $\smash{\hat{P}}$ represents the portfolio default rate, averaged over all simulations. However, the portfolio default rate $P$ varies around $\bar{P}=1$ % in the simulations, introducing an extra variance $\smash{\sigma_{P}^{2}}$ in $\smash{\sigma_{U}^{2}}$ :

\sigma_{U}^{2}={\bigg{\{}\frac{\partial U}{\partial\mathrm{AUC}}\bigg{\}}^{2}}% \sigma_{\mathrm{AUC}}^{2}+\bigg{\{}\frac{\partial U}{\partial P}\bigg{\}}^{2}% \sigma_{P}^{2}+2\rho_{\mathrm{AUC},P}\bigg{\{}\frac{\partial U}{\partial% \mathrm{AUC}}\sigma_{\mathrm{AUC}}\bigg{\}}\bigg{\{}\frac{\partial U}{\partial p% }\sigma_{p}\bigg{\}},

(6.3)

in which

\frac{\partial U}{\partial\mathrm{AUC}}=N^{2}P(1-P)

and

\frac{\partial U}{\partial P}=({1-2P})N^{2}\mathrm{AUC}=\frac{{(1-2P)U}}{{P(1-% P)}}.

The correlation $\rho_{\mathrm{AUC},P}$ varies generally between $-0.05$ and $0.05$ in the simulations, and the standard deviation $\sigma_{P}$ is $0.003$ . Although these quantities are quite small, we correct the variance $\sigma_{U}^{2}$ by subtracting the two extra terms in (6.3). Figures 2 and 3 show that, after correction, the variance $\sigma^{2}_{U}/\{N^{2}\hat{P}(1-\hat{P})\}^{2}$ resembles the variance $\smash{\sigma_{\mathrm{AUC}}^{2}}$ for different areas under the curve for both case 1 and case 2.

Next, we show by simulation whether the probabilities $Q_{1}$ and $Q_{2}$ align with those of the HM, BN and DF methods. The simulations confirm (3.1) and (3.2); numerical integration leads to the same probabilities as counting the events $[S_{\mathrm{d}1},S_{\mathrm{d}2}<S_{\mathrm{nd}}]$ and $[S_{\mathrm{nd}1},S_{\mathrm{nd}2}>S_{\mathrm{d}}]$ , but it requires less computation time. Therefore, we used the integrals in (3.1) and (3.2) to derive the probabilities from the simulations. These integrals were calculated with the trapezium rule.

Comparison of the variance ... with the variance ... for case 1. — Figure 2: Comparison of the variance $\sigma_{\text{AUC}}^{\text{2}}$ with the variance $\sigma_{U}^{\text{2}}/\{N^{\text{2}}P(\text{1}-P)\}^{\text{2}}$ for case 1.

Comparison of the variance ... with the variance ... for case 2. — Figure 3: Comparison of the variance $\sigma_{\text{AUC}}^{\text{2}}$ with the variance $\sigma_{U}^{\text{2}}/\{N^{\text{2}}P(\text{1}-P)\}^{\text{2}}$ for case 2.

Probabilities Q sub 1 as a function of the AUC, based on simulations and on the Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1. — Figure 4: Probabilities $Q_{\text{1}}$ as a function of the AUC, based on simulations and on the Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.

Probabilities Q sub 2 as a function of the AUC, based on simulations and on the Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1. — Figure 5: Probabilities $Q_{\text{2}}$ as a function of the AUC, based on simulations and on the Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) method for case 1.

Figures 4 and 5 compare the probabilities from simulations with their theoretical counterparts. The dots in Figure 4 are the probabilities $Q_{1,\mathrm{sim}}$ , derived from the simulations, whereas the solid, dashed and dotted lines represent the probabilities as calculated by the HM method using (4.2), the BN method using (4.8) and the DF method using (5.4). Similarly, Figure 5 compares the probabilities $Q_{2,\mathrm{sim}}$ , resulting from the simulations, with the probabilities as calculated by the HM, BN and DF methods. The figures reveal that the probabilities $Q_{1,\mathrm{sim}}$ and $Q_{2,\mathrm{sim}}$ are close to $Q_{1}$ and $Q_{2}$ as calculated by the DF, HM and BN methods. To make this more quantitative, we calculated the deviation between the simulated probabilities and the actual probabilities resulting from each method as the sum of their quadratic difference. Table 2 compares the deviations $\Delta Q_{1}$ and $\Delta Q_{2}$ for probabilities $Q_{1}$ and $Q_{2}$ , respectively, for the HM, BN and DF methods and for cases 1 and 2. The table shows that the DF method gives the smallest deviations, except for probability $Q_{2}$ in case 2, where the BN method gives the smallest deviation. We conclude that the DF and BN methods give more accurate estimations of probabilities $Q_{1}$ and $Q_{2}$ than the HM method.

Table 2: Deviation between probabilities

Q_{\text{1},\text{sim}}

and

Q_{\text{2},\text{sim}}

, based on simulations, and corresponding probabilities based on the DF, BN and HM methods.

(a) Deviation $\Delta Q_{\text{1}}=\sum(Q_{\text{1},m}-Q_{\text{1},\text{sim}})^{\text{2}}$
	Case 1	Case 2
Distribution-free ( $m=\text{DF}$ )	0.0009	0.0007
Binormal ( $m=\text{BN}$ )	0.0011	0.0014
Hanley–McNeil ( $m=\text{HM}$ )	0.0031	0.0056
(b) Deviation $\Delta Q_{\text{2}}=\sum(Q_{\text{2},m}-Q_{\text{2},\mathrm{sim}})^{\text{2}}$
	Case 1	Case 2
Distribution-free ( $m=\text{DF}$ )	0.0005	0.0029
Binormal ( $m=\text{BN}$ )	0.0015	0.0007
Hanley–McNeil ( $m=\text{HM}$ )	0.0028	0.0057

Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley--McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 1. — Figure 6: Comparison of the variance resulting from simulations with the variance calculated by the numerical integration (NI), Hanley–McNeil (HM), binormal (BN) and distribution-free (DF) methods for case 1.

We verified the NI, HM, BN and DF methods by performing 1000 simulations for a portfolio of 100 000 counterparties with a portfolio default rate $P=1$ %. In each simulation, we construct an ROC and calculate $\smash{\widehat{\mathrm{AUC}}}$ by the trapezium rule for integration. $\smash{\widehat{\mathrm{AR}}}$ follows from $\smash{\widehat{\mathrm{AUC}}}$ by $\smash{\widehat{\mathrm{AUC}}=2\widehat{\mathrm{AR}}-1}$ . We obtain the true $\mathrm{AR}$ by averaging $\smash{\widehat{\mathrm{AR}}}$ over all 1000 simulations. Since the variance of the mean decreases with $1/n$ , the average $\smash{\widehat{\mathrm{AR}}}$ closely approximates the true $\mathrm{AR}$ for $n=1000$ simulations. We performed these simulations for several accuracy ratios, ranging from $-1$ for an inverted credit scoring model to $1$ for a credit scoring model with perfect discriminatory power.

Figure 6 presents the sample variances, resulting from the simulations and the different methods for case 1, as a function of the true AR: the black diamonds represent the sample variance in the AR resulting from the 1000 simulations; the white dots represent the sample variance of the NI method, calculated by (2.8), (3.1) and (3.2); the solid line represents the sample variance of the HM method as calculated by (4.4); the gray dashed line represents the sample variance of the BN method calculated by (4.11); the dotted line represents the sample variance of the DF method calculated by (5.8).

Figure 7 shows the same results for case 2. These figures show that the NI method closely resembles the sample variance $\smash{\sigma_{\widehat{\mathrm{AR}}}^{2}}$ as calculated from the simulations. The HM and BN methods result in a sample variance $\smash{\sigma_{\widehat{\mathrm{AR}}}^{2}}$ that is an asymmetric function of the real AR, whereas the DF method gives a symmetric function. We use the $R^{2}$ as a measure of the extent to which the HM, BN or DF method fits the sample variance $\smash{\sigma_{\widehat{\mathrm{AR}}}^{2}}$ , resulting from the simulations. Table 3 shows the $R^{2}$ for the NI, HM, BN and DF methods. For both cases 1 and 2, the NI method gives the highest $R^{2}$ and the HM method gives the lowest $R^{2}$ . Considering the parametric methods, the DF method gives the most accurate sample variance for case 1, whereas the BN method is most accurate for case 2. The DF method overestimates the sample variance and is therefore conservative in case 2.

Table 3: Values of

R^{\text{2}}

(in percent) for cases 1 and 2, based on the NI, DF, BN and HM methods.

Methodology	Case 1	Case 2
Numerical integration	99.3	99.5
Distribution-free	96.2	83.2
Binormal	89.2	96.3
Hanley–McNeil	82.0	65.8

We consider case 1 to be more realistic than case 2, since it includes effects such as skewness and kurtosis. Case 2 assumes homogeneously distributed scores, which will not often occur in practice. Since the DF method shows the best performance in case 1, we decided to investigate this method further. We regressed

\frac{3N_{\mathrm{d}}N_{\mathrm{nd}}}{N_{\mathrm{nd}}+N_{\mathrm{d}}+1}\sigma_% {\widehat{\mathrm{AR}}}^{2}

(6.4)

against $a_{0}+a_{1}\mathrm{AR}+a_{2}\mathrm{AR}^{2}$ , using the results for case 1. According to (5.8), $a_{0}=1$ , $a_{1}=0$ and $a_{2}=-1$ .

Table 4: Regression coefficients of (6.4) against AR and

\text{AR}^{\text{2}}

, with their 95% confidence interval and their standard error.

		Standard
	Coefficient	error	$?$ -value	$?$ -value (%)
Intercept ( $a_{\text{0}}$ )	$-$ $\text{0.97}\pm\text{0.03}$	0.02	60.1	0.00
Coefficient AR ( $a_{\text{1}}$ )	$-\text{0.04}\pm\text{0.03}$	0.02	02.7	0.02
Coefficient AR ${}^{\text{2}}$ ( $a_{\text{2}}$ )	$-\text{1.03}\pm\text{0.06}$	0.03	34.3	0.00
$R^{\text{2}}$ of regression (%)	98.6

Comparison of the probability density functions of the AR, based on 1000 simulations, with the normal distribution. The normal distribution is represented by the dashed line. (a) AR=-0.99. (b) AR=-0.36. (c) AR=-0.16. (d) AR=0.15. (e) AR=0.41. (f) AR=0.61. (g) AR=0.76. (h) AR=0.85. (i) AR=0.95. — Figure 8: Comparison of the probability density functions of the AR, based on 1000 simulations, with the normal distribution. The normal distribution is represented by the dashed line. (a) $\text{AR}=-\text{0.99}$ . (b) $\text{AR}=-\text{0.36}$ . (c) $\text{AR}=-\text{0.16}$ . (d) $\text{AR}=\text{0.15}$ . (e) $\text{AR}=\text{0.41}$ . (f) $\text{AR}=\text{0.61}$ . (g) $\text{AR}=\text{0.76}$ . (h) $\text{AR}=\text{0.85}$ . (i) $\text{AR}=\text{0.95}$ .

Jarque--Bera statistics of several AR distributions, based on 1000 simulations of a portfolio that consists of 100,000 counterparties with a portfolio default rate of 1%, as a function of the mean AR. — Figure 9: Jarque–Bera statistics of several AR distributions, based on 1000 simulations of a portfolio that consists of 100 000 counterparties with a portfolio default rate of 1%, as a function of the mean AR.

Table 4 presents the regression results: $a_{0}$ and $a_{2}$ agree with these values within the 95% confidence interval, but $a_{1}$ gives a small but significant nonzero contribution. This comes from the higher-order term as explained by (5.9) and ignored in the DF method. Including this term improves the $R^{2}$ from 96.2% to 98.6%. We conclude that the sample variance $\smash{\sigma_{\widehat{\mathrm{AR}}}^{2}}$ relates to $\smash{1-\mathrm{AR}^{2}}$ to a large extent, but the DF method may overestimate the sample variance by ignoring asymmetric and nonlinear effects. In addition, Figure 7 shows that the sample variance deviates from a parabolic function of $\mathrm{AR}$ as a result of excluding higher-order terms.

We constructed probability density functions of the AR from the simulations for case 1. Figure 8 presents the probability density functions for several $\mathrm{AR}$ values. The dashed line represents a normal distribution with the same mean and standard deviation as the AR density. The $\mathrm{AR}$ densities resemble a normal distribution, except when the AR is close to $1$ or $-1$ . In these cases, the AR densities become skewed and the sample accuracy ratios are not normally distributed. We used the Jarque–Bera statistics to test the normality of the AR densities. The Jarque–Bera statistics are calculated by

\mathrm{JB}=N_{\mathrm{sim}}\bigg{(}\frac{\mathrm{SK}^{2}}{6}+\frac{\mathrm{EK% }^{2}}{24}\bigg{)}

(6.5)

with $N_{\mathrm{sim}}$ as the number of simulations and the symbols $\mathrm{SK}$ and $\mathrm{EK}$ denoting the skewness and excess kurtosis of the AR density, respectively. If the probability density functions are normal, the Jarque–Bera statistics are chi-squared distributed with two degrees of freedom. Based on a significance level of 5%, the null hypothesis of normality is rejected when the Jarque–Bera statistics exceed a threshold of $\smash{\chi_{2}^{2~{}\text{inv}}(0.95)=5.99}$ . Figure 9 shows the Jarque–Bera statistics for the different densities versus the mean $\mathrm{AR}$ . The dashed line represents the threshold above which the null hypothesis of normality is rejected. The Jarque–Bera test rejects normality for $\mathrm{AR}<-0.95$ and $\mathrm{AR}>0.95$ . We conclude that the probability density function of $\smash{\widehat{\mathrm{AR}}}$ follows a normal distribution when $|\mathrm{AR}|<0.95$ .

7 Conclusions

Metrics such as the AR and the AUC measure the discriminatory power of a credit scoring model. However, these are often estimated for a sample of counterparties, and different samples may give different values. Firm conclusions on model performance can only be drawn if the sample distribution, or at least the sample variance, is known. Traditional equations for the sample variance are based on the equivalence with the Mann–Whitney statistics, but these calculations require a long computation time.

We derived equations for the sample variance in the AR and AUC using four methods: the NI method is based on numerical integration; the HM and BN methods assume specific score distributions; and the DF method does not depend on such an assumption, but excludes higher-order terms. Simulations show that the AR is normally distributed and the NI method gives the best estimation of the sample variance in the AUC or AR. However, the DF and BN methods provide closed-form equations, which fit the sample variance of the simulations quite well. The HM method performs worst in all cases. We conclude that the NI method is the preferred method to estimate the sample variance. The probabilities $Q_{1}$ and $Q_{2}$ are easily calculated from the areas under the curves $(x,y^{2})$ and $(y,(1-x^{2}))$ , respectively, in the same way as the AUC is estimated from the ROC.

An advantage of the BN and DF methods is that they provide closed-form equations for the sample variance. It is hard to conclude which method is most appropriate. The DF method performs better than the BN method in case 1, which is considered more realistic than case 2. In case 2, the BN method is the most accurate and the DF method overestimates the variance.

We can apply the NI, DF or BN methods in hypothesis testing, since the observed $\widehat{\mathrm{AR}}$ follows a normal distribution. For example, we can define the following statistics with the DF method:

z_{\mathrm{AR}}=(|\mathrm{AR}-\mathrm{AR}_{0}|)\bigg{(}\frac{(N_{\mathrm{nd}}+% N_{\mathrm{d}}+1)(1-\mathrm{AR}_{0}^{2})}{3N_{\mathrm{d}}N_{\mathrm{nd}}}\bigg% {)}^{\!-1/2}

(7.1)

to test a null hypothesis such as the following.

(H ${}_{0}$ ) The credit rating or scoring model has an accuracy ratio of $\mathrm{AR}_{0}$ , ie, $\mathrm{AR}=\mathrm{AR}_{0}$ .

Based on the normality of the AR distribution, the $p$ -value is calculated by $p=1-\varPhi[z_{\mathrm{AR}}]$ and the null hypothesis is rejected when $p<5$ %.

Recently, the ECB introduced a test for comparing the initial AUC versus the current AUC (European Central Bank 2019), in which they derive the sample variance from the Mann–Whitney statistics. Alternatively, we suggest the NI, DF or BN method to calculate the sample variance in the AUC. We presumed a credit scoring model, which generates continuous scores, and ties rarely occur, while the ECB introduces this test for credit ratings rather than credit scores. Credit ratings are discrete values and ties occur. Often, rating models are derived from scoring models with a continuous outcome, such as logistic regression or discriminant analysis. As such, the test described here can be applied to the credit scores before they are mapped to credit ratings. Recent research reveals that the information loss that results from mapping credit scores to credit ratings is low (van der Burgt 2019).

Declaration of interest

The author reports no conflict of interest. The author alone is responsible for the content and writing of the paper. The views expressed in this article are those of the author and do not necessarily reflect those of Nationale Nederlanden Group.

References

Basel Committee on Banking Supervision (2005). Studies on the validation of internal rating systems. Working Paper 14, Basel Committee on Banking Supervision, Bank for International Settlements. URL: http://www.bis.org/publ/bcbs_wp14.pdf.
Cortes, C., and Mohri, M. (2004). Confidence intervals for the area under the ROC curve. In Proceedings of the 17th International Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, Volume 17, pp. 305–312. MIT Press, Cambridge, MA.
Engelmann, B., Hayden, E., and Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion Paper 01/2003, Series 2: Banking and Financial Supervision, Deutsche Bundesbank.
European Central Bank (2019). Instructions for reporting the validation results of internal models: IRB Pillar I models for credit risk. Report, February, European Central Bank, Frankfurt. URL: https://bit.ly/3gTxMnM.
Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (https://doi.org/10.1148/radiology.143.1.7063747).
Iyer, R., Khwaja, A. I., Luttmer, E. F., and Shue, K. (2015). Screening peers softly: inferring the quality of small borrowers. Management Science 62(6), 1554–1577 (https://doi.org/10.1287/mnsc.2015.2181).
Krzanowski, W., and Hand, D. J. (2009). ROC Curves for Continuous Data. Monographs on Statistics and Applied Probability, Volume 111. Chapman & Hall/CRC Press, Boca Raton, FL.
Patefield, M., and Tandy, D. (2000). Fast and accurate calculation of Owen’s $T$ function. Journal of Statistical Software 5(5), 1–25 (https://doi.org/10.18637/jss.v005.i05).
Russell, H., Tang, Q. K., and Dwyer, D. W. (2012). The effect of imperfect data on default prediction validation tests. The Journal of Risk Model Validation 6(1), 77–96 (https://doi.org/10.21314/JRMV.2012.085).
Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation. The Journal of Risk Model Validation 1(1), 77–113 (https://doi.org/10.21314/JRMV.2007.002).
Stein, R. M. (2016). Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC. Data Mining and Knowledge Discovery 30(4), 763–796 (https://doi.org/10.1007/s10618-015-0437-7).
Tasche, D. (2008). Validation of internal rating systems and PD estimates. In The Analytics of Risk Model Validation, Christodoulakis, G., and Satchell, S. (eds), pp. 169–196. Academic Press (https://doi.org/10.1016/b978-075068158-2.50014-7).
Tasche, D. (2010). Estimating discriminatory power and PD curves when the number of defaults is small. Preprint (arXiv:0905.3928v2). URL: http://arxiv.org/pdf/0905.3928.pdf.
van der Burgt, M. J. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).
van der Burgt, M. J. (2019). Calibration and mapping of credit scores by riding the cumulative accuracy profile. The Journal of Credit Risk 15(1), 1–25 (https://doi.org/10.21314/JCR.2018.240).
Wu, J. C., Martin, A. F., and Kacker, R. N. (2016). Validation of nonparametric two-sample bootstrap in ROC analysis on large datasets. Communications in Statistics: Simulation and Computation 45(5), 1689–1703 (https://doi.org/10.1080/03610918.2015.1065327).