Journal of Credit Risk

Risk.net

Calibration and mapping of credit scores by riding the cumulative accuracy profile

Marco van der Burgt

  • This paper introduces a new algorithm for mapping credit scores to ratings, based on step-wise partitioning of the cumulative accuracy profile.
  • Testing the algorithm by simulating two different scoring models reveals that the algorithm generates significantly different rating grades and a monotonous default probability scale.
  • The algorithm also shows that rating calibration is a trade-off between optimizing discriminatory power and avoiding concentrations in rating grades.
  • The algorithm produces a number of rating classes, which behaves as a power-law of the Accuracy Ratio of the credit scores and the unconditional probability of default.
     

A lot of literature on credit risk scoring techniques exists, but less research is available regarding the mapping of credit scores to ratings and the calibration of ratings. This paper introduces an algorithm for mapping credit scores to credit ratings and estimating a probability of default (PD) per rating grade. The algorithm is based on stepwise partitioning of the cumulative accuracy profile, such that requirements like stable ratings and a monotonous PD scale, as stated by the European Banking Association’s regulatory technical standards, are fulfilled. We test the algorithm by simulating different PD models and score distributions. These tests reveal that the algorithm maps credit scores to significantly different rating grades. Each rating cor- responds to a PD, which is a monotonous function of the rating grade. The tests also show that the total number of rating grades, which result from the mapping algorithm, strongly depends on the ability of the scoring model to discriminate between defaulting and nondefaulting counterparties.

1 Introduction

Financial institutions that qualify for the internal ratings-based (IRB) approach are allowed to develop their own internal rating models to assess the credit quality of their customers (Basel Committee on Banking Supervision 2004). The Basel Committee on Banking Supervision (BCBS) recently proposed to restrict the development of internal rating models for financial institutions (Basel Committee on Banking Supervision 2016). The portfolio under consideration should be suitable for modelling, and enough historical data should be available. The proposal only allows well-known statistical methods such as logistic regression (Hayden and Porath 2006). These models use borrower data to generate a credit score, which is mapped to a rating.

The rating represents the creditworthiness of the borrower. Before applying ratings in credit processes, financial institutions attach a probability of default (PD) to each rating. This process is referred to as calibration. In our view, there is not much literature on calibration techniques. Stein (2007) remarks that calibration relates to the discriminatory power of a model. Discriminatory power is the ability to discriminate ex ante between defaulting and nondefaulting borrowers (Tasche 2005). Earlier approaches use the relation between discriminatory power and calibration (Falkenstein et al 2000; van der Burgt 2008; Tasche 2010). They demonstrate how the calibration of a rating system can be derived from the cumulative accuracy profile (CAP), which visualizes discriminatory power and is therefore also called a power curve.

The BCBS proposal also states that ratings should remain stable over time. Having unstable ratings leads to large volatility in the PDs, which causes volatility in loan pricing, economic capital and regulatory capital. The rating stability closely relates to the rating philosophy, ie, the behavior of credit ratings during macroeconomic cycles (Araten 2007). In general, two rating philosophies exist: point-in-time (PIT) ratings adjust quickly to changing macroeconomic conditions, whereas through-the-cycle (TTC) ratings tend to remain stable over the length of the macroeconomic business cycle. The rating philosophy depends to a large extent on the input variables of the credit scoring model. PIT rating models use current financial ratios and trends as input, whereas TTC rating models use structural factors such as technology dependence, supplier risk and peer industry scores as input.

This paper assumes a TTC credit scoring model based on widely used statistical techniques and introduces an algorithm for mapping credit scores to ratings and calibrating these ratings. Our algorithm, which builds on the approaches described above, uses an iterative stepwise partitioning of the x-axis of the CAP. This partitioning depends on the shape of the CAP; therefore, we call this method “riding the curve”. The algorithm results in stable ratings, as required by the BCBS proposal, and addresses other regulatory requirements regarding the number of rating grades and the concentration of exposures per grade. For example, the European Banking Association (EBA) recently published draft regulatory technical standards, which state that “the number of rating grades and pools is adequate to ensure a meaningful risk differentiation” (European Banking Association 2016). Further, “the concentration of numbers of exposures or obligors is not excessive in any grade or pool, unless supported by convincing empirical evidence of homogeneity of risk of those exposures or obligors”. The algorithm also complies with other requirements, such as the monotonous increase in PD with deteriorating credit rating.

This paper is organized as follows. Section 2 describes the CAP and how it relates to the conditional PD given a credit score. Section 3 introduces requirements for rating mapping and calibration. Section 4 shows how the CAP curve is applied in mapping and calibration by addressing these requirements. Section 5 demonstrates the CAP-based calibration for two credit scoring models. The last section concludes. Appendix A (available online) provides the mathematical background of the mapping algorithm, and Appendix B (also available online) presents closed-form equations for the CAP.

Table 1 provides an overview of the mathematical symbols used in this paper.

Table 1: Summary of mathematical symbols.
Symbol Description
i Counterparty index
Pu Unconditional PD
Pc(s) PD, conditional on the credit score s
F[s] Distribution of counterparties over score s
FD[s] Distribution of defaulting counterparties over score s
x Observed cumulative distribution of counterparties over score s
y Observed cumulative distribution of defaulting counterparties over
  score s
xr Cumulative distribution of counterparties with a rating worse than r
yr Cumulative distribution of defaulting counterparties with a rating
  worse than r
AS Area under the CAP of the credit scoring system
AR Area under the CAP of the rating system
ARS Accuracy ratio of the credit scoring system
ARR Accuracy ratio of the rating system
IL Information loss due to mapping scores to ratings
NT Total number of counterparties in the loan portfolio
DT Total number of defaulting counterparties in the loan portfolio
Nr Total number of counterparties in rating grade r
Dr Total number of defaulting counterparties in rating grade r
ϕ[] Cumulative standard normal distribution
ϕ-1[] Inverse of the cumulative standard normal distribution
LD Minimum limit for significantly different rating grades
Rtot Total number of rating grades, resulting from the calibration
λ Parameter, related to number of counterparties, unconditional PD
  and the accuracy ratio
a1, a2, β Parameters of the credit scoring model
ε Estimation error

2 Description of the cumulative accuracy profile

This section briefly introduces the CAP, which is central in our mapping and calibration algorithm. An extensive discussion of the CAP is given elsewhere (Falkenstein et al 2000; Engelmann et al 2003; Tasche 2010). We assume a credit scoring model that assigns a credit score s between a minimum score Smin and a maximum score Smax to each counterparty in the credit portfolio. A high credit score corresponds to a low credit risk.

A sound credit scoring system exhibits high discriminatory power. Two curves exist to visualize discriminatory power: the CAP and the receiver operating curve (ROC). Both power curves are constructed by sorting the debtors from low scores to high scores, ie, by decreasing credit risk. The ROC presents the cumulative percentage of defaulting counterparties versus the cumulative percentage of nondefaulting counterparties (Tasche 2005, 2010; Irwin and Irwin 2012). By construction, every point on the ROC gives a measure of type I and type II error (Stein 2007), which makes the ROC eligible for optimal decision thresholds (Irwin and Irwin 2012). Further, models exist for fitting ROCs, based on the underlying score distributions, and the area under the ROC can be interpreted as an unbiased percentage of correct classifications (Irwin and Callaghan 2006; Irwin and Irwin 2012).

The CAP represents the cumulative percentage of defaults versus the cumulative percentage of counterparties. We prefer the CAP for calibration because it allows a derivation of the PD (Stein 2007). To see this, we define x as the cumulative distribution of counterparties with scores lower than s,

  x=F[s]=P[Ss],   (2.1)

and y as the cumulative distribution of defaulting counterparties with scores lower than s,

  y=FD[s]=P[SsD].   (2.2)

Combining (2.1) and (2.2) gives y=FD[F-1[x]] for the CAP, which has the following property (Falkenstein et al 2000; Tasche 2010):

  Pc(s)=Pu[dydx]x=x(s).   (2.3)

This property does not hold for the ROC. In (2.3), Pc(s) presents the conditional PD given the credit score s,

  P[DS=s]=Pc(s),   (2.4)

and Pu presents the unconditional PD, obtained by integrating Pc(s) over the score distribution density dF[s]/ds:

  Pu=P[D]=SminSmaxPc(s)dF[s]dsds=01Pc(s)dF[s]=01Pc(x)dx.   (2.5)

In practice, Pu is a sample default rate, which is estimated as the total number of defaults DT divided by the total number of counterparties NT in the portfolio. When the sample default rate deviates significantly from the “true” population default rate, (2.3) leads to an erroneous calibration. Since the sample default rate converges to the population default rate in the limit NT, this risk is limited when (2.3) is applied only to large portfolios such as retail portfolios. We will consider the consequences of population versus sample default rate in the algorithm description.

CAP for a credit scoring model with perfect, high, moderate or no discriminatory power.
Figure 1: CAP for a credit scoring model with perfect, high, moderate or no discriminatory power.

Figures 1 and 2 demonstrate the CAP and (2.3) for several credit scoring models. A credit scoring model with moderate discriminatory power has a CAP with a concave shape, and the conditional PD decreases gradually with increasing x. For highly discriminatory models, only defaults occur at the worst credit scores. In this case, the CAP curve rises steeply for small x and the PD curve decreases fast with increasing x, as shown by the black curves in the figures. The black dashed lines in the figures correspond to a model with no discriminatory power. In this case, the cumulative percentage of defaults increases proportionally with the increasing number of counterparties, and the CAP resembles the line y=x. The CAP is an almost linear function, and its derivative is independent of x. The corresponding PD curve is flat and equals Pu for all counterparties, as shown Figure 2.

PD, derived from the CAP with perfect, high, moderate or no discriminatory power.
Figure 2: PD, derived from the CAP with perfect, high, moderate or no discriminatory power.

Since we investigate the mapping of credit scores in relation to their discriminatory power, we also introduce the accuracy ratio for credit scoring (ARS). This is defined as (Tasche 2005, 2010):

  ARS=2AS-11-Pu   (2.6)

where AS is the area under the CAP:

  AS=i=1NT12[x(i)-x(i-1)][y(i)+y(i-1)],   (2.7)

with x(0)=0 and y(0)=0. The ARS varies between 0 for nondiscriminatory scoring models and 1 for perfect discriminatory scoring models.

After mapping scores to ratings, a counterparty has a rating r when their credit score S is larger than sr-1 and smaller than or equal to sr. In this paper, the rating r is an integer, which increases with decreasing credit risk. We define xr as the cumulative distribution of counterparties with a rating worse than r,

  xr=P[Rr]=P[Ssr],   (2.8)

and yr as the cumulative distribution of defaults with a rating worse than r,

  yr=P[RrD]=P[SsrD].   (2.9)

We derive the PD per rating grade from the CAP in a similar way to (2.3) for credit scores. Bayes’s law gives the following equation for the PD, conditional on rating r:

  Pr =P[DR=r]  
    =P[Dr-1<Rr]  
    =P[D{r-1<Rr}]P[r-1<Rr]  
    =P[r-1<RrD]P[D]P[r-1<Rr].   (2.10)

Using the definitions xr=P[Rr] and yr=P[RrD] gives

  Pr=P[RrD]-P[Rr-1D]P[Rr]-P[Rr-1]P[D]=yr-yr-1xr-xr-1Pu.   (2.11)

Equation (2.11) expresses the PD in CAP coordinates x and y. It is a discrete version of (2.3), since credit scores vary continuously in the interval [Smin,Smax], while ratings only have discrete values r=1,2,,Rtot, where Rtot is the total number of rating grades.

The mapping of credit scores to ratings reduces information, because ratings have a lower granularity than credit scores. To investigate this information loss, we calculate the area under the CAP in terms of ratings as

  AR=i=1Rtot12[xr-xr-1][yr+yr-1],   (2.12)

with x0=0 and y0=0. We use this result to calculate an accuracy ratio ARR for ratings, similar to in (2.6), and define information loss as

  IL=ARS-ARRARS=AS-ARAS-A0,   (2.13)

with A0=12 as the area under the CAP for a nondiscriminatory scoring system. The IL compares the information loss AS-AR with the maximum possible information loss AS-A0. This gives two extreme cases.

  • When ARS0, the credit scoring model has barely any discriminatory power: no significantly different default risk exists between all of the scores, which are mapped to a single rating grade. In this case, Rtot=1, leading to AR=12 by (2.12) and IL=1 by (2.13).

  • When the credit scoring model has high discriminatory power, the credit scores are mapped to a high number of ratings and the information loss converges to zero.

We will use the information loss in testing the mapping algorithm.

3 Calibration requirements of rating systems

The draft regulatory technical standards of the EBA state the following (European Banking Association 2016):

  1. (1)

    the number of rating grades and pools is adequate to ensure a meaningful risk differentiation; and

  2. (2)

    the concentration of numbers of exposures or obligors is not excessive in any grade or pool, unless supported by convincing empirical evidence of homogeneity of risk of those exposures or obligors.

Based on this, we introduce two calibration requirements.

Requirement 1: monotonicity and stability of the PD

This requirement means that the PD Pr increases monotonously with increasing credit risk, and that the mapping of scores to ratings should be stable. We assume a TTC scoring model. To avoid unstable ratings and enforce monotonicity, the default risk in rating grade r-1 should be significantly larger than the default risk in the adjacent rating grade r. This means Pr-1>Pr and rejection of the following null hypothesis:

  H0: PD of rating grade r=PD of rating grade r-1.  
Table 2: Contingency table for testing difference in default rates of adjacent rating grades.
  Nondefaults Defaults Total
Rating class r-1 Nr-1-Dr-1 Dr-1 Nr-1
Rating class r Nr-Dr Dr Nr
Total Nr-1+Nr-Dr-1-Dr Dr-1+Dr Nr-1+Nr

We use a contingency table (see Table 2) to test the null hypothesis. Table 2 shows the number of counterparties (Nr), the number of nondefaulting counterparties (Nr-Dr) and the number of defaulting counterparties (Dr) per rating grade r. The table presents similar quantities for rating r-1. The following test statistic is derived from Table 2 (Fleis et al 2003; Sheskin 2000):

  T2=[Dr(Nr-1-Dr-1)-Dr-1(Nr-Dr)]2(Nr+Nr-1-Dr-Dr-1)(Dr+Dr-1)NrNr-1(Nr+Nr-1).   (3.1)

This statistics is chi-squared distributed with one degree of freedom: T2χ12. T itself is standard normally distributed. Defining a significance α, the null hypothesis is rejected at confidence level 1-α when T>ϕ-1[1-α], where ϕ-1[] represents the inverse of the cumulative standard normal distribution. We call this the adjacent rating grade (ARG) test.

The ARG test can be translated into CAP terminology. Using Nr=NT(xr-xr-1) and Dr=DT(yr-yr-1) in (3.1), the test statistics T becomes a function T(xrxr-1,xr-2) of xr given xr-1 and xr-2:

  T(xrxr-1,xr-2)=Pr-1-PrPr,r-1(1-Pr,r-1)(xr-xr-1)(xr-1-xr-2)(xr-xr-2)NT.   (3.2)

The probabilities Pr-1 and Pr are calculated by (2.11), and Pr,r-1 is a weighted average of Pr-1 and Pr:

  Pr,r-1=yr-yr-2xr-xr-2Pu=xr-xr-1xr-xr-2Pr+xr-1-xr-2xr-xr-2Pr-1.   (3.3)

Defining the limit LD=ϕ-1[1-α], the null hypothesis is rejected when

  T(xrxr-1,xr-2)LD.   (3.4)

Since α is small, the limit LD is positive, and (3.4) means that the test statistics should also be positive, which is the case when Pr-1>Pr. Therefore, (3.4) also implies monotonicity of the PD scale. Appendix A online shows that the test statistics T2(xrxr-1,xr-2) can be approximated by a parabolic function of xr,

  T2(xrxr-1,xr-2)λ(xr-xr-1)(xr-1-xr-2)(xr-xr-2),   (3.5)

with parameter λ based on a closed-form expression y=FD[F-1[x]]=C(x) for the CAP,

  λ=Pu{C′′(xr-1)}24C(xr-2)NT,   (3.6)

in which C and C′′ denote the first and second derivative of the function C(x), respectively. Figure 3 shows T2(xrxr-1,xr-2) versus xr. The test statistics is a monotonously increasing function of xr on the interval [xr-1,): when xr increases, starting from xr-1, the test statistics starts at zero and increases as well. The approximation in (3.5) shows that T2(xrxr-1,xr-2)=LD2 when xr reaches a critical value xc:

  xc=xr-1+xr-1-xr-22[4LD2λ(xr-1-xr-2)3-1].   (3.7)

Therefore, the requirement of stable ratings means that xr should exceed a critical value: xrxc.

The test statistics ... as a function of x-sub-r. When x-sub-r equals a critical value ..., the test statistics equals ....
Figure 3: The test statistics T2(xrxr-1,xr-2) as a function of xr. When xr equals a critical value xc, the test statistics equals LD2.

Requirement 2: no excessive concentrations in a rating grade

The EBA guidelines prescribe no excessive concentration in a rating grade (European Banking Association 2016). A rating system does not make sense when most counterparties are assigned to a single rating, leading to a loss in discriminatory power as discussed in Section 2. A proper rating system distinguishes sufficiently different risk levels. Specifically, when the counterparties in the portfolio are homogeneously distributed over the rating grades, we have

  xr-xr-1Δxfor all 1Rtot.   (3.8)

Equations (3.5) and (3.6) reveal that these equidistant intervals occur when

  xr-xr-1=xr-1-xr-2=LD22λ3.   (3.9)

This result is also derived using (A.7) in Appendix A online. Equation (3.9) aligns with intuition: when the limit LD in the ARG test increases, larger intervals xr-xr-1 are required to obtain significantly different default rates per grade r. On the other hand, λ is large for a highly discriminatory scoring system, resulting in small intervals.

Equation (3.9) is based on the assumption that λ is constant. In practice, λ varies with xr-1 and xr-2, especially for highly discriminatory scoring models. In these cases, only defaults occur at the worst credit scores, which correspond to small x values. In the region where x approaches 1, the slope of the CAP – and therefore the default rate – is almost zero, and one cannot define rating grades with significantly different default rates. This may cause high concentrations in low-risk ratings. Since the focus is on stable ratings and the monotonicity of the PD curve, we let the first requirement prevail over the second. This also aligns with the EBA guidelines (European Banking Association 2016, Article 36.1b).

4 Mapping scores to ratings by “riding the curve”

This section describes how to find a score interval [sr-1,sr] for every rating r by using the CAP, such that the requirements of the previous section are fulfilled. Equation (3.7) allows a mapping of credit scores to ratings by an iterative procedure, which starts at x=0 and proceeds stepwise toward x=1, depending on how fast the CAP increases with increasing x. Therefore, we call this algorithm “riding the curve”. Setting xr equal to xc in every iteration step results in an interval [xr-1,xr] for every rating grade. Since every xr corresponds uniquely to a score sr by (2.8), we can translate the interval [xr-1,xr] into a score interval [sr-1,sr] per rating grade. However, we need to consider three issues.

First, the procedure requires a function y=C(x) for the CAP to calculate λ. Several authors have introduced functions for the CAP (van der Burgt 2008; Tasche 2010). If we know the score distributions of the counterparties and defaulting counterparties, we can derive a function y=FD[F-1[x]] from these distributions (Tasche 2010). Since these distributions are unknown in general, we use a regression approach by fitting the CAP to a multi-exponential function C(x):

  y=C(x)+ε=j=1mBj1-exp(-kjx)1-exp(-kj)+ε,   (4.1)

with j=1mBj=1 and

  j=1mBjkj1-exp(-kj)1Pu.  

Due to the first constraint, the number of parameters is 2m-1. The second constraint guarantees that Pc(x) cannot be larger than 1, as shown in Appendix B online. Equation (4.1) is based on the following assumptions.

  • The error terms ε have means equal to zero and are independent: E[ε]=0 and E[ε×ε]=0. Dependency among the error terms does not impact the parameters themselves, but it may increase the standard errors in the parameters.

  • The probability Pc(x) is a multi-exponential function, as shown in Appendix B online.

The error terms are kept small by selecting enough exponential functions (m) in C(x). Based on experience, m=2 already gives a good fit to most CAPs. We use C(x) in (3.6) and (3.7) to find xr=xc. As shown in Appendix A online, (3.7) is based on second-order Taylor approximations of C(x). Therefore, we also use (3.2) for testing T(xrxr-1,xr-2)LD.

A second issue is the determination of the first interval [x0,x1]. We set x0=0, but (3.7) cannot be applied here to find x1. The setting of interval [x0,x1] is also critical: when this interval is small, the next interval [x1,x2], as calculated by (3.7), will be large. This may give rise to periodic partitioning. Appendix A online shows that the parabolic behavior of the test statistics T(xrxr-1,xr-2) around xr may result in subsequent small and large intervals of x. To avoid this, we set x0=0 and apply the equidistant interval of (3.9) for r=1:

  x1=LD22λ3,   (4.2)

with λ calculated as λ=PuNT({C′′(0)}2/4)C(0).

The third aspect is the dependence on the sample default probability Pu. When the sample default probability is 0.75% while the “true” population default probability is 1.25%, the parameter λ is “too small” and results in a widening of the intervals [xr-1,xr] according to (3.7) and (3.9). The rating grades still represent significantly different rating grades, but the probability Pr per rating r is underestimated by (2.11). This reverses when the sample default rate is higher than the population default rate: the default probabilities are overestimated and the intervals [xr-1,xr] are too small, resulting in rating grades that are not significantly different in default risk. The “true” population default rate is not known, but we can construct a confidence interval Pu±1.96Pu(1-Pu)/NT that contains the population default rate at a 95% confidence level, assuming independence in default behavior. When NT is large enough, the population default rate will not substantially differ from Pu. This makes the calibration algorithm eligible for retail portfolios, which consist of a high number of counterparties.

Based on the above considerations, the algorithm for mapping and calibration proceeds as follows.

  1. (1)

    Construct the CAP curve as described in Section 2 and fit the CAP to a closed-form function C(x).

  2. (2)

    Find the first interval [x0,x1] for rating r=1, which is the worst credit rating grade, by setting x0=0 and calculating x1 by (4.2).

  3. (3)

    Given the interval [xr-2,xr-1], set xr=xc with xc as calculated by (3.7). Eventually, increase xr until T(xrxr-1,xr-2)LD. If xr is larger than 1, set xr=1.

  4. (4)

    Step (3) is repeated until xr=1. Then, the total number of rating grades Rtot equals r. Because xr is capped at 1 in Step (3), the highest rating grade and the second-to-highest rating grade may not exhibit a significant difference in default risk. In that case, these rating grades are merged together into one rating grade.

  5. (5)

    Given xr, find sr for every r=1,,Rtot so that every interval [xr-1,xr] is translated into [sr-1,sr]. The PD is Pr=((yr-yr-1)/(xr-xr-1))Pu for every r=1,,Rtot.

5 Testing the algorithm by simulation

We investigated whether the algorithm maps scores to ratings with the two requirements fulfilled and how many rating grades (Rtot) result. We set the significance level as α=5%, giving Φ-1[1-α]=1.65 in requirement 1. Using a margin, we rounded this value to 2, giving LD=2 in (3.4). We use two credit scoring models to test the algorithm.

  1. (1)

    In scoring model 1, the credit scores vary uniformly between Smin=0 and Smax=100, and the conditional PD Pc(s) depends exponentially on the score s. As shown in Appendix B online, this gives the following conditional PD:

      Pc(s)=βPuexp{-β(S-Smin)}1-exp{-β(Smax-Smin)}.   (5.1)

    The parameter β is positive, ie, a high score means low credit risk. Under these conditions, we can derive a closed-form equation for the CAP that corresponds to m=1 in (4.1).

  2. (2)

    Scoring model 2 assumes a logit function for the conditional PD Pc(s):

      Pc(s)=a11+exp{β(S-a2)},   (5.2)

    with a2=80. The credit scores s are normally distributed:

      sϕ[S0,σ2],   (5.3)

    with S0=100 and σ=10. Similar to scoring model 1, β>0 and a high score means low credit risk. We cannot derive a closed-form equation for the CAP under these circumstances, but using a bi-exponential form for the CAP that corresponds to m=2 in (4.1) results in a good fit.

In both cases, parameters β are changed to vary the ARS. The default/nondefault state of a counterparty i with score s is simulated by a Bernoulli variable B[i;Pc(s)], which equals 1 (default) with probability Pc(s) and 0 (nondefault) with probability 1-Pc(s). After running the simulations for NT counterparties, the CAP of scoring model 1 was regressed to (4.1) with m=1. The CAP of scoring model 2 was regressed to (4.1) with m=2. The regression quality is high: in all cases, the adjusted R2 is larger than 99.8%.11 1 The adjusted R2 is calculated as Radjusted2=1-((NT-1)/(NT-(2m-1)))(1-R2) since the number of parameters is 2m-1.

Figures 46 present the mapping and calibration results of scoring model 1 with ARS=18%, ARS=56% and ARS=91%, respectively. The number of counterparties is NT=100 000 and the unconditional PD is Pu=1% in all three cases.

Table 3: Intervals (xr-1, xr), the critical value xc, score intervals (sr-1, sr), percentage of counterparties (CPs) and default rate per rating grade r and p-value of the ARG test, resulting from the mapping algorithm for scoring model 1 with Pu=1% and NT=100000 respectively.
(a) AR=18%
            CPs ??? ?-value
? ??-? ?? ?? ??-? ?? (%) (%) (%)
1 0.0000 0.1531 0.1531 00 015.30 15.31 1.61
2 0.1531 0.3457 0.3457 15.30 034.63 19.26 1.26 0.53
3 0.3457 0.5293 0.5293 34.63 053.03 18.36 0.92 0.17
4 0.5293 1.0000 0.7488 53.03 100.00 47.07 0.75 2.96
(b) AR=56%
            CPs ??? ?-value
? ??-? ?? ?? ??-? ?? (%) (%) (%)
1 0.0000 0.0478 0.0478 00 004.74 04.78 3.77
2 0.0478 0.1096 0.1096 04.74 010.93 06.18 3.02 3.19
3 0.1096 0.1731 0.1688 10.93 017.37 06.35 2.44 4.55
4 0.1731 0.2462 0.2421 17.37 024.71 07.31 1.94 4.54
5 0.2462 0.3215 0.3215 24.71 032.03 07.53 1.34 0.40
6 0.3215 0.4118 0.4118 32.03 041.05 09.03 0.97 2.68
7 0.4118 0.5127 0.5127 41.05 051.05 10.09 0.70 3.99
8 0.5127 0.6352 0.6352 51.05 063.29 12.26 0.38 0.10
9 0.6352 0.7854 0.7854 63.29 078.34 15.01 0.21 0.92
10 0.7854 1.0000 0.9848 78.34 100.00 21.46 0.07 0.03
(c) AR=91%
            CPs ??? ?-value
? ??-? ?? ?? ??-? ?? (%) (%) (%)
1 0.0000 0.0100 0.0100 00 001.01 01.00 18.33
2 0.0100 0.0230 0.0230 01.01 002.34 01.30 14.92 2.87
3 0.0230 0.0355 0.0355 02.34 003.68 01.25 10.26 0.04
4 0.0355 0.0573 0.0507 03.68 005.93 02.17 08.23 4.55
5 0.0573 0.0701 0.0701 05.93 007.22 01.29 05.83 0.87
6 0.0701 0.0929 0.0929 07.22 009.43 02.28 04.16 2.50
7 0.0929 0.1133 0.1133 09.43 011.50 02.04 02.41 0.13
8 0.1133 0.1503 0.1420 11.50 015.22 03.70 01.65 4.55
9 0.1503 0.1827 0.1827 15.22 018.43 03.24 00.68 0.02
10 0.1827 0.2366 0.2351 18.43 023.86 05.40 00.37 4.55
11 0.2366 0.3222 0.3222 23.86 032.25 08.55 00.09 0.04
12 0.3222 1.0000 0.5599 32.25 100.00 67.78 00.00 0.00

The dashed lines split the CAP into bands that correspond to a rating: a small (large) bandwidth means low (high) concentration in the rating grade. Each interval [xr-1,xr] corresponds to a score interval [sr-1,sr], which is tabulated in Table 3. The table demonstrates that the default rate monotonously increases with the rating. Table 3 also presents the critical value xc as calculated by (3.7). In most cases, xr=xc; xr exceeds the critical value in a few cases due to the following reasons.

  1. (1)

    The critical value xc is based on an approximation of T(xrxr-1,xr-2).

  2. (2)

    In the highest rating, xr>xc due to Step (4) in the algorithm: when xr is larger than 1, the rating grade r is merged with rating r-1 into a single rating grade.

The last column presents the p-value of the ARG tests. Since the p-values are all below 5%, the rating grades differ significantly in default rate.

CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=18% and ...). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a ... of 99.95%.
Figure 4: CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=18% and Pu=1%). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a Radjusted2 of 99.95%.

Figure 4 shows that the algorithm maps credit scores to only four ratings for a credit scoring system with low discriminatory power. The PDs in each rating grade are small and close to the unconditional PD of 1%. Rating grade 4 exhibits a large concentration of 47%.

CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=56% and ...). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a ... of 99.92%.
Figure 5: CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=56% and Pu=1%). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a Radjusted2 of 99.92%.

Figure 5 presents the ratings and corresponding PDs for a credit scoring system with considerable discriminatory power. The algorithm produces ten ratings and there is a substantial difference between the PDs per grade. The CAP in Figure 5 also shows no excessive concentrations: the largest concentration is 21%, occurring in the rating grade with the lowest risk.

CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=91% and ...). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a ... of 99.97%.
Figure 6: CAP (part (a)) and default probabilities per rating (part (b)), constructed from 100 000 counterparties (scoring model 1 with AR=91% and Pu=1%). The dashed lines in the CAP represent the partitioning of the x-axis into distinct rating grades, resulting from the mapping algorithm. The regression of the CAP gives a Radjusted2 of 99.97%.

Figure 6 demonstrates the results of the algorithm for a credit scoring system with an ARS close to 1, ie, the discriminatory power is extremely high. As expected, the algorithm only distinguishes significantly different rating grades at small x-values. The algorithm produces narrow bands in this range and the rating corresponding to the lowest default risk has an excessive concentration of 68%. As explained in Section 2, the mapping of a highly discriminatory credit scoring model leads to a trade-off between avoiding excessive concentrations and identifying significantly different rating grades.

An important question is how many rating grades result from the algorithm. Equation (3.6) shows that λ relates to the concave shape of the CAP, the number of counterparties and the unconditional PD. When the CAP is more concave, the ARS is higher, and more significantly different rating grades result. Figures 46 also support this. To identify the relation between Rtot and ARS, we examined the following cases.

First, we use a theoretical case in which C(x) equals (4.1) with m=1, giving the following λ:

  λ=PuNT4{C′′(xr-1)}2C(xr-2)=PuNT4k13exp[-k1(2xr-1-xr-2)]1-exp[-k1].   (5.4)

We calculate the number of rating grades (Rtot) theoretically by iteration until xr1:

  xr=xr-1+xr-1-xr-22[4LD2λ(xr-1-xr-2)3-1]   (5.5)

for several values of k1. The accuracy ratio follows from

  ARS=11-Pu{201C(x)dx-1}=11-Pu{2(1exp(-k1)-1k1)-1}.   (5.6)

Second, we investigated the relation between Rtot and the ARS by simulating scoring model 1 and scoring model 2 for 100 000 counterparties with different parameters β and a1, such that the ARS varied between 7% and 96%, while the unconditional PD Pu equals 1%.

Number of rating grades (...) versus the ... for scoring model 1, scoring model 2 and the theoretical model, based on iterating (...).
Figure 7: Number of rating grades (Rtot) versus the ARS for scoring model 1, scoring model 2 and the theoretical model, based on iterating (5.5).

Figure 7 presents the number of rating grades Rtot versus ARS for scoring model 1 and scoring model 2. As expected, the figure shows that the number of rating grades increases with the accuracy ratio. The dashed line in Figure 7 presents the Rtot, derived by iterating (5.5), versus the ARS as calculated by (5.6). The observations of model system 1 and model system 2 resemble the theoretically calculated dashed line.

A direct relation between ARS and Rtot follows from the relation between the ARS and the concave shape of the CAP: C′′(x) is more negative when the ARS is high. Therefore, λ depends on the ARS by

  λPuNT{C′′(x)}2PuNTARS2.  

Substituting this dependency in (3.8) and (3.9) gives

  Rtot1Δx=(2λLD2)1/3Pu1/3NT1/3ARS2/3.   (5.7)

A regression shows that Rtot{ARS}0.71 for scoring model 1, Rtot{ARS}0.71 for scoring model 2 and Rtot{ARS}0.63 for the theoretical curve. This means that all graphs closely resemble (5.7), although Figure 6 shows that the mapping algorithm does not always result in a homogeneous distribution over the ratings. Therefore, (5.7) is only a rule-of-thumb to predict how many rating grades can be distinguished for a given accuracy ratio.

Number of rating grades (...) versus the unconditional probability of default (...) for scoring model 1 and scoring model 2, based on simulating 100 000 counterparties.
Figure 8: Number of rating grades (Rtot) versus the unconditional probability of default (Pu) for scoring model 1 and scoring model 2, based on simulating 100 000 counterparties.
Information loss as a function of ... for scoring model 1 and scoring model 2, based on simulating 100 000 counterparties and an unconditional PD of 1%.
Figure 9: Information loss as a function of ARS for scoring model 1 and scoring model 2, based on simulating 100 000 counterparties and an unconditional PD of 1%.

We investigated (5.7) further by performing simulations of both scoring models with different unconditional PDs. Figure 8 presents Rtot versus Pu. A regression shows that Rtot{Pu}0.40 for scoring model 1 and Rtot{Pu}0.38 for scoring model 2. The exponentials are close to 0.33, as predicted by (5.7). We conclude that this equation explains to a large extent the power law behavior of Rtot as a function of ARS and Pu.

We also investigated how much discriminatory power is lost by mapping scores to ratings, using the information loss measure as defined in (2.13). Figure 9 reveals that the information loss decreases exponentially with increasing ARS. Therefore, the information loss for both scoring models is regressed against the following equation:

  lnIL=c1ARS+c2ARS2.   (5.8)

Equation (5.8) has no intercept since the information loss converges to 1 when ARS0, as explained in Section 2.

Table 4: Coefficients and their 95% confidence interval and standard error (SE), estimated by regression of ln(IL) against the accuracy ratio ARS.
  Scoring model 1 Scoring model 2
     
      ?-value     ?-value
  Coeff. SE (%) Coeff. SE (%)
Coefficient -11.1±1.1 0.54 0.00 -10.6±1.7 0.81 0.00
of ARS (c1)            
Coefficient -5.3±1.3 0.65 0.00 -5.5±2.3 1.09 0.01
of ARS2 (c2)            
Radjusted2 92.1%     80.5%    

Table 4 presents the regression coefficients and the Radjusted2. The coefficients of both scoring models agree within a 95% confidence interval. Therefore, we conclude that the information loss as a function of the ARS is similar for both scoring models, although the Radjusted2 shows that the quality of the fit for scoring model 1 is considerably better than for scoring model 2. Further, the information loss is low for credit scoring models with moderate to high discriminatory power: when the ARS is higher than 60%, the information loss is lower than 1%.

Since the algorithm depends on the population default rate and discriminatory power, we conclude with two considerations. First, both quantities are inferred from the data. Obviously, data flaws, such as assigning the information of a defaulted counterparty to a nondefault status, directly impact the estimation of Pu and therefore Pr, but they also lead to an underestimated accuracy ratio (Stein 2016; Russell et al 2012). This causes a widening of the intervals xr-xr-1. The rating grades still represent significantly different default risks, but (5.7) shows that the possible number of rating grades is underestimated, and the information loss increases according to (5.8) and Figure 9. Stein (2016) suggests procedures for estimating the number of mislabeled defaults and correcting the power statistics for these errors.

Second, the population default rate and accuracy ratio are related via the default definition. For example, when the default definition changes from payments ninety days in arrears to payments ten days in arrears, relatively solvent counterparties also count as default. The population default rate increases but the accuracy ratio may decrease. Equation (5.7) shows that the accuracy ratio has more of an impact than the Pu, and the number of rating grades will be lower in case of a stricter default definition. The decrease in the accuracy ratio also leads to an increase in information loss according to (5.8) and Figure 9. Therefore, successful application of the algorithm also depends on the data quality and a proper default definition.

6 Conclusion

This paper presents a method for mapping credit scores to ratings and calibrating these ratings by assigning a PD per rating grade. The algorithm is based on two requirements: (1) monotonicity and stability of the PD per rating by enforcing significantly different default risks between adjacent rating grades; and (2) avoiding excessive concentrations. Our algorithm uses stepwise partitioning of the x-axis of the CAP.

We tested the algorithm by simulating two scoring models. The algorithm maps credit scores to significantly different ratings. The algorithm shows that rating calibration is a trade-off between optimizing discriminatory power, avoiding concentrations in rating grades, and calibration quality in terms of significantly different rating grades. Highly discriminatory credit scoring models lead to an excessive concentration in the low-risk rating grade. High concentrations are avoided by accepting a lower significance level in defining different ratings or designing credit scoring models with moderate rather than high discriminatory power. The algorithm maps credit scores to ratings, but a small amount of information is lost, as ratings are less granular than credit scores.

The simulations of the two different scoring models reveal that the number of rating grades, which result from the algorithm, exhibit a power law behavior as a function of the accuracy ratio of the credit scores and the unconditional PD. The information loss decreases with an increasing accuracy ratio of the credit scoring model. This power law behavior and the information loss is similar for both scoring models, although these scoring models are based on different score distributions and conditional PDs. Our overall conclusion is that the algorithm is most sensitive to the discriminatory power of the scoring model, but less dependent on the underlying score distribution. Further, the algorithm outcomes are sensitive to the data quality by the “garbage in, garbage out” principle as well as to the default definition.

Declaration of interest

The views expressed in this paper are those of the author and do not necessarily reflect those of Nationale Nederlanden Group. The author would like to thank the anonymous referees for their suggestions.

References

  • Araten, M. (2007). Development and validation of key estimates for capital models. In The Basel Handbook, Ong, M. (eds), 2nd edn. Risk Books London.
  • Basel Committee on Banking Supervision (2004). International convergence of capital measurement and capital standards (a revised framework). Report, Bank for International Settlements.
  • Basel Committee on Banking Supervision (2016). Reducing variation in credit risk-weighted assets – constraints on the use of internal model approaches. Consultative Document, Bank for International Settlements.
  • Engelmann, B., Hayden, E., and Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion Paper Series 2: Banking and Financial Studies 2003.01, Deutsche Bundesbank.
  • European Banking Association (2016). Final Draft Regulatory Technical Standards on the specification of the assessment methodology for competent authorities regarding compliance of an institution with the requirements to use the IRB approach in accordance with Articles 144(2), 173(3) and 180(3)(b) of Regulation (EU) No 575/2013. EBA/RTS/2016/03, July 21, EBA.
  • Falkenstein, E., Boral, A., and Carty, L. (2000). RiskCalcTM for private companies: Moody’s default model. Global Credit Reseach, Moody’s Investor Service.
  • Fleis, J. L., Levin, B., and Paik, M. C. (2003). Statistical Methods for Rates and Proportions, 2nd edn, Wiley (https://doi.org/10.1002/0471445428).
  • Hayden, E., and Porath, D. (2006). Statistical methods to develop rating models. In The Basel II Risk Parameters, Estimation, Validation Stress Testing – With Applications to Loan Risk Management, Engelmann, B., Rauhmeier, R. (eds), 2nd edn. Springer (https://doi.org/10.1007/3-540-33087-9_1).
  • Irwin, R. J., and Callaghan, K. S. N. (2006). Ejection decisions by strike pilots: an extreme value interpretation. Aviation, Space, and Environmental Medicine 77, 62–64.
  • Irwin, R. J., and Irwin, T. C. (2012). Appraising credit ratings: does the CAP fit better than the ROC? Working Paper WP/12/122, International Monetary Fund.
  • Russell, H., Tang, Q. K., and Dwyer, D. W. (2012). The effect of imperfect data on default prediction validation tests. The Journal of Risk Model Validation 6(1), 1–20 (https://doi.org/10.21314/JRMV.2012.085).
  • Sheskin, D. J. (2000). Handbook of Parametric and Non-Parametric Statistical Procedures, 2nd edn. Chapman & Hall/CRC.
  • Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation. The Journal of Risk Model Validation 1(1), 77–113 (https://doi.org/10.21314/JRMV.2007.002).
  • Stein, R. M. (2016). Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC. Data Mining and Knowledge Discovery 30(4), 763–796 (https://doi.org/10.1007/s10618-015-0437-7).
  • Tasche, D. (2005). Rating and probability of default validation. Working Paper 14, Studies on the Validation of Internal Rating Systems, Basel Committee on Banking Supervision, Bank for International Settlements.
  • Tasche, D. (2010). Estimating discriminatory power and PD curves when the number of defaults is small. Working Paper, Lloyds Banking Group.
  • van der Burgt, M. J. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to copy this content. Please contact info@risk.net to find out more.

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here