Journal of Credit Risk
ISSN:
1744-6619 (print)
1755-9723 (online)
Editor-in-chief: Linda Allen and Jens Hilscher
Calibration and mapping of credit scores by riding the cumulative accuracy profile
Need to know
- This paper introduces a new algorithm for mapping credit scores to ratings, based on step-wise partitioning of the cumulative accuracy profile.
- Testing the algorithm by simulating two different scoring models reveals that the algorithm generates significantly different rating grades and a monotonous default probability scale.
- The algorithm also shows that rating calibration is a trade-off between optimizing discriminatory power and avoiding concentrations in rating grades.
- The algorithm produces a number of rating classes, which behaves as a power-law of the Accuracy Ratio of the credit scores and the unconditional probability of default.
Abstract
A lot of literature on credit risk scoring techniques exists, but less research is available regarding the mapping of credit scores to ratings and the calibration of ratings. This paper introduces an algorithm for mapping credit scores to credit ratings and estimating a probability of default (PD) per rating grade. The algorithm is based on stepwise partitioning of the cumulative accuracy profile, such that requirements like stable ratings and a monotonous PD scale, as stated by the European Banking Association’s regulatory technical standards, are fulfilled. We test the algorithm by simulating different PD models and score distributions. These tests reveal that the algorithm maps credit scores to significantly different rating grades. Each rating cor- responds to a PD, which is a monotonous function of the rating grade. The tests also show that the total number of rating grades, which result from the mapping algorithm, strongly depends on the ability of the scoring model to discriminate between defaulting and nondefaulting counterparties.
Introduction
1 Introduction
Financial institutions that qualify for the internal ratings-based (IRB) approach are allowed to develop their own internal rating models to assess the credit quality of their customers (Basel Committee on Banking Supervision 2004). The Basel Committee on Banking Supervision (BCBS) recently proposed to restrict the development of internal rating models for financial institutions (Basel Committee on Banking Supervision 2016). The portfolio under consideration should be suitable for modelling, and enough historical data should be available. The proposal only allows well-known statistical methods such as logistic regression (Hayden and Porath 2006). These models use borrower data to generate a credit score, which is mapped to a rating.
The rating represents the creditworthiness of the borrower. Before applying ratings in credit processes, financial institutions attach a probability of default (PD) to each rating. This process is referred to as calibration. In our view, there is not much literature on calibration techniques. Stein (2007) remarks that calibration relates to the discriminatory power of a model. Discriminatory power is the ability to discriminate ex ante between defaulting and nondefaulting borrowers (Tasche 2005). Earlier approaches use the relation between discriminatory power and calibration (Falkenstein et al 2000; van der Burgt 2008; Tasche 2010). They demonstrate how the calibration of a rating system can be derived from the cumulative accuracy profile (CAP), which visualizes discriminatory power and is therefore also called a power curve.
The BCBS proposal also states that ratings should remain stable over time. Having unstable ratings leads to large volatility in the PDs, which causes volatility in loan pricing, economic capital and regulatory capital. The rating stability closely relates to the rating philosophy, ie, the behavior of credit ratings during macroeconomic cycles (Araten 2007). In general, two rating philosophies exist: point-in-time (PIT) ratings adjust quickly to changing macroeconomic conditions, whereas through-the-cycle (TTC) ratings tend to remain stable over the length of the macroeconomic business cycle. The rating philosophy depends to a large extent on the input variables of the credit scoring model. PIT rating models use current financial ratios and trends as input, whereas TTC rating models use structural factors such as technology dependence, supplier risk and peer industry scores as input.
This paper assumes a TTC credit scoring model based on widely used statistical techniques and introduces an algorithm for mapping credit scores to ratings and calibrating these ratings. Our algorithm, which builds on the approaches described above, uses an iterative stepwise partitioning of the -axis of the CAP. This partitioning depends on the shape of the CAP; therefore, we call this method “riding the curve”. The algorithm results in stable ratings, as required by the BCBS proposal, and addresses other regulatory requirements regarding the number of rating grades and the concentration of exposures per grade. For example, the European Banking Association (EBA) recently published draft regulatory technical standards, which state that “the number of rating grades and pools is adequate to ensure a meaningful risk differentiation” (European Banking Association 2016). Further, “the concentration of numbers of exposures or obligors is not excessive in any grade or pool, unless supported by convincing empirical evidence of homogeneity of risk of those exposures or obligors”. The algorithm also complies with other requirements, such as the monotonous increase in PD with deteriorating credit rating.
This paper is organized as follows. Section 2 describes the CAP and how it relates to the conditional PD given a credit score. Section 3 introduces requirements for rating mapping and calibration. Section 4 shows how the CAP curve is applied in mapping and calibration by addressing these requirements. Section 5 demonstrates the CAP-based calibration for two credit scoring models. The last section concludes. Appendix A (available online) provides the mathematical background of the mapping algorithm, and Appendix B (also available online) presents closed-form equations for the CAP.
Table 1 provides an overview of the mathematical symbols used in this paper.
Symbol | Description |
---|---|
Counterparty index | |
Unconditional PD | |
PD, conditional on the credit score | |
Distribution of counterparties over score | |
Distribution of defaulting counterparties over score | |
Observed cumulative distribution of counterparties over score | |
Observed cumulative distribution of defaulting counterparties over | |
score | |
Cumulative distribution of counterparties with a rating worse than | |
Cumulative distribution of defaulting counterparties with a rating | |
worse than | |
Area under the CAP of the credit scoring system | |
Area under the CAP of the rating system | |
Accuracy ratio of the credit scoring system | |
Accuracy ratio of the rating system | |
IL | Information loss due to mapping scores to ratings |
Total number of counterparties in the loan portfolio | |
Total number of defaulting counterparties in the loan portfolio | |
Total number of counterparties in rating grade | |
Total number of defaulting counterparties in rating grade | |
Cumulative standard normal distribution | |
Inverse of the cumulative standard normal distribution | |
Minimum limit for significantly different rating grades | |
Total number of rating grades, resulting from the calibration | |
Parameter, related to number of counterparties, unconditional PD | |
and the accuracy ratio | |
, , | Parameters of the credit scoring model |
Estimation error |
2 Description of the cumulative accuracy profile
This section briefly introduces the CAP, which is central in our mapping and calibration algorithm. An extensive discussion of the CAP is given elsewhere (Falkenstein et al 2000; Engelmann et al 2003; Tasche 2010). We assume a credit scoring model that assigns a credit score between a minimum score and a maximum score to each counterparty in the credit portfolio. A high credit score corresponds to a low credit risk.
A sound credit scoring system exhibits high discriminatory power. Two curves exist to visualize discriminatory power: the CAP and the receiver operating curve (ROC). Both power curves are constructed by sorting the debtors from low scores to high scores, ie, by decreasing credit risk. The ROC presents the cumulative percentage of defaulting counterparties versus the cumulative percentage of nondefaulting counterparties (Tasche 2005, 2010; Irwin and Irwin 2012). By construction, every point on the ROC gives a measure of type I and type II error (Stein 2007), which makes the ROC eligible for optimal decision thresholds (Irwin and Irwin 2012). Further, models exist for fitting ROCs, based on the underlying score distributions, and the area under the ROC can be interpreted as an unbiased percentage of correct classifications (Irwin and Callaghan 2006; Irwin and Irwin 2012).
The CAP represents the cumulative percentage of defaults versus the cumulative percentage of counterparties. We prefer the CAP for calibration because it allows a derivation of the PD (Stein 2007). To see this, we define as the cumulative distribution of counterparties with scores lower than ,
(2.1) |
and as the cumulative distribution of defaulting counterparties with scores lower than ,
(2.2) |
Combining (2.1) and (2.2) gives for the CAP, which has the following property (Falkenstein et al 2000; Tasche 2010):
(2.3) |
This property does not hold for the ROC. In (2.3), presents the conditional PD given the credit score ,
(2.4) |
and presents the unconditional PD, obtained by integrating over the score distribution density :
(2.5) |
In practice, is a sample default rate, which is estimated as the total number of defaults divided by the total number of counterparties in the portfolio. When the sample default rate deviates significantly from the “true” population default rate, (2.3) leads to an erroneous calibration. Since the sample default rate converges to the population default rate in the limit , this risk is limited when (2.3) is applied only to large portfolios such as retail portfolios. We will consider the consequences of population versus sample default rate in the algorithm description.
Figures 1 and 2 demonstrate the CAP and (2.3) for several credit scoring models. A credit scoring model with moderate discriminatory power has a CAP with a concave shape, and the conditional PD decreases gradually with increasing . For highly discriminatory models, only defaults occur at the worst credit scores. In this case, the CAP curve rises steeply for small and the PD curve decreases fast with increasing , as shown by the black curves in the figures. The black dashed lines in the figures correspond to a model with no discriminatory power. In this case, the cumulative percentage of defaults increases proportionally with the increasing number of counterparties, and the CAP resembles the line . The CAP is an almost linear function, and its derivative is independent of . The corresponding PD curve is flat and equals for all counterparties, as shown Figure 2.
Since we investigate the mapping of credit scores in relation to their discriminatory power, we also introduce the accuracy ratio for credit scoring (). This is defined as (Tasche 2005, 2010):
(2.6) |
where is the area under the CAP:
(2.7) |
with and . The varies between 0 for nondiscriminatory scoring models and 1 for perfect discriminatory scoring models.
After mapping scores to ratings, a counterparty has a rating when their credit score is larger than and smaller than or equal to . In this paper, the rating is an integer, which increases with decreasing credit risk. We define as the cumulative distribution of counterparties with a rating worse than ,
(2.8) |
and as the cumulative distribution of defaults with a rating worse than ,
(2.9) |
We derive the PD per rating grade from the CAP in a similar way to (2.3) for credit scores. Bayes’s law gives the following equation for the PD, conditional on rating :
(2.10) |
Using the definitions and gives
(2.11) |
Equation (2.11) expresses the PD in CAP coordinates and . It is a discrete version of (2.3), since credit scores vary continuously in the interval , while ratings only have discrete values , where is the total number of rating grades.
The mapping of credit scores to ratings reduces information, because ratings have a lower granularity than credit scores. To investigate this information loss, we calculate the area under the CAP in terms of ratings as
(2.12) |
with and . We use this result to calculate an accuracy ratio for ratings, similar to in (2.6), and define information loss as
(2.13) |
with as the area under the CAP for a nondiscriminatory scoring system. The IL compares the information loss with the maximum possible information loss . This gives two extreme cases.
- •
- •
When the credit scoring model has high discriminatory power, the credit scores are mapped to a high number of ratings and the information loss converges to zero.
We will use the information loss in testing the mapping algorithm.
3 Calibration requirements of rating systems
The draft regulatory technical standards of the EBA state the following (European Banking Association 2016):
- (1)
the number of rating grades and pools is adequate to ensure a meaningful risk differentiation; and
- (2)
the concentration of numbers of exposures or obligors is not excessive in any grade or pool, unless supported by convincing empirical evidence of homogeneity of risk of those exposures or obligors.
Based on this, we introduce two calibration requirements.
Requirement 1: monotonicity and stability of the PD
This requirement means that the PD increases monotonously with increasing credit risk, and that the mapping of scores to ratings should be stable. We assume a TTC scoring model. To avoid unstable ratings and enforce monotonicity, the default risk in rating grade should be significantly larger than the default risk in the adjacent rating grade . This means and rejection of the following null hypothesis:
Nondefaults | Defaults | Total | |
---|---|---|---|
Rating class | |||
Rating class | |||
Total |
We use a contingency table (see Table 2) to test the null hypothesis. Table 2 shows the number of counterparties (), the number of nondefaulting counterparties and the number of defaulting counterparties () per rating grade . The table presents similar quantities for rating . The following test statistic is derived from Table 2 (Fleis et al 2003; Sheskin 2000):
(3.1) |
This statistics is chi-squared distributed with one degree of freedom: . itself is standard normally distributed. Defining a significance , the null hypothesis is rejected at confidence level when , where represents the inverse of the cumulative standard normal distribution. We call this the adjacent rating grade (ARG) test.
The ARG test can be translated into CAP terminology. Using and in (3.1), the test statistics becomes a function of given and :
(3.2) |
The probabilities and are calculated by (2.11), and is a weighted average of and :
(3.3) |
Defining the limit , the null hypothesis is rejected when
(3.4) |
Since is small, the limit is positive, and (3.4) means that the test statistics should also be positive, which is the case when . Therefore, (3.4) also implies monotonicity of the PD scale. Appendix A online shows that the test statistics can be approximated by a parabolic function of ,
(3.5) |
with parameter based on a closed-form expression for the CAP,
(3.6) |
in which and denote the first and second derivative of the function , respectively. Figure 3 shows versus . The test statistics is a monotonously increasing function of on the interval : when increases, starting from , the test statistics starts at zero and increases as well. The approximation in (3.5) shows that when reaches a critical value :
(3.7) |
Therefore, the requirement of stable ratings means that should exceed a critical value: .
Requirement 2: no excessive concentrations in a rating grade
The EBA guidelines prescribe no excessive concentration in a rating grade (European Banking Association 2016). A rating system does not make sense when most counterparties are assigned to a single rating, leading to a loss in discriminatory power as discussed in Section 2. A proper rating system distinguishes sufficiently different risk levels. Specifically, when the counterparties in the portfolio are homogeneously distributed over the rating grades, we have
(3.8) |
This result is also derived using (A.7) in Appendix A online. Equation (3.9) aligns with intuition: when the limit in the ARG test increases, larger intervals are required to obtain significantly different default rates per grade . On the other hand, is large for a highly discriminatory scoring system, resulting in small intervals.
Equation (3.9) is based on the assumption that is constant. In practice, varies with and , especially for highly discriminatory scoring models. In these cases, only defaults occur at the worst credit scores, which correspond to small values. In the region where approaches 1, the slope of the CAP – and therefore the default rate – is almost zero, and one cannot define rating grades with significantly different default rates. This may cause high concentrations in low-risk ratings. Since the focus is on stable ratings and the monotonicity of the PD curve, we let the first requirement prevail over the second. This also aligns with the EBA guidelines (European Banking Association 2016, Article 36.1b).
4 Mapping scores to ratings by “riding the curve”
This section describes how to find a score interval for every rating by using the CAP, such that the requirements of the previous section are fulfilled. Equation (3.7) allows a mapping of credit scores to ratings by an iterative procedure, which starts at and proceeds stepwise toward , depending on how fast the CAP increases with increasing . Therefore, we call this algorithm “riding the curve”. Setting equal to in every iteration step results in an interval for every rating grade. Since every corresponds uniquely to a score by (2.8), we can translate the interval into a score interval per rating grade. However, we need to consider three issues.
First, the procedure requires a function for the CAP to calculate . Several authors have introduced functions for the CAP (van der Burgt 2008; Tasche 2010). If we know the score distributions of the counterparties and defaulting counterparties, we can derive a function from these distributions (Tasche 2010). Since these distributions are unknown in general, we use a regression approach by fitting the CAP to a multi-exponential function :
(4.1) |
with and
Due to the first constraint, the number of parameters is . The second constraint guarantees that cannot be larger than 1, as shown in Appendix B online. Equation (4.1) is based on the following assumptions.
- •
The error terms have means equal to zero and are independent: and . Dependency among the error terms does not impact the parameters themselves, but it may increase the standard errors in the parameters.
- •
The probability is a multi-exponential function, as shown in Appendix B online.
The error terms are kept small by selecting enough exponential functions () in . Based on experience, already gives a good fit to most CAPs. We use in (3.6) and (3.7) to find . As shown in Appendix A online, (3.7) is based on second-order Taylor approximations of . Therefore, we also use (3.2) for testing .
A second issue is the determination of the first interval . We set , but (3.7) cannot be applied here to find . The setting of interval is also critical: when this interval is small, the next interval , as calculated by (3.7), will be large. This may give rise to periodic partitioning. Appendix A online shows that the parabolic behavior of the test statistics around may result in subsequent small and large intervals of . To avoid this, we set and apply the equidistant interval of (3.9) for :
(4.2) |
with calculated as .
The third aspect is the dependence on the sample default probability . When the sample default probability is 0.75% while the “true” population default probability is 1.25%, the parameter is “too small” and results in a widening of the intervals according to (3.7) and (3.9). The rating grades still represent significantly different rating grades, but the probability per rating is underestimated by (2.11). This reverses when the sample default rate is higher than the population default rate: the default probabilities are overestimated and the intervals are too small, resulting in rating grades that are not significantly different in default risk. The “true” population default rate is not known, but we can construct a confidence interval that contains the population default rate at a 95% confidence level, assuming independence in default behavior. When is large enough, the population default rate will not substantially differ from . This makes the calibration algorithm eligible for retail portfolios, which consist of a high number of counterparties.
Based on the above considerations, the algorithm for mapping and calibration proceeds as follows.
- (1)
Construct the CAP curve as described in Section 2 and fit the CAP to a closed-form function .
- (2)
Find the first interval for rating , which is the worst credit rating grade, by setting and calculating by (4.2).
- (3)
Given the interval , set with as calculated by (3.7). Eventually, increase until . If is larger than 1, set .
- (4)
Step (3) is repeated until . Then, the total number of rating grades equals . Because is capped at 1 in Step (3), the highest rating grade and the second-to-highest rating grade may not exhibit a significant difference in default risk. In that case, these rating grades are merged together into one rating grade.
- (5)
Given , find for every so that every interval is translated into . The PD is for every .
5 Testing the algorithm by simulation
We investigated whether the algorithm maps scores to ratings with the two requirements fulfilled and how many rating grades () result. We set the significance level as , giving in requirement 1. Using a margin, we rounded this value to 2, giving in (3.4). We use two credit scoring models to test the algorithm.
- (1)
In scoring model 1, the credit scores vary uniformly between and , and the conditional PD depends exponentially on the score . As shown in Appendix B online, this gives the following conditional PD:
(5.1) The parameter is positive, ie, a high score means low credit risk. Under these conditions, we can derive a closed-form equation for the CAP that corresponds to in (4.1).
- (2)
Scoring model 2 assumes a logit function for the conditional PD :
(5.2) with . The credit scores are normally distributed:
(5.3) with and . Similar to scoring model 1, and a high score means low credit risk. We cannot derive a closed-form equation for the CAP under these circumstances, but using a bi-exponential form for the CAP that corresponds to in (4.1) results in a good fit.
In both cases, parameters are changed to vary the . The default/nondefault state of a counterparty with score is simulated by a Bernoulli variable , which equals 1 (default) with probability and 0 (nondefault) with probability . After running the simulations for counterparties, the CAP of scoring model 1 was regressed to (4.1) with . The CAP of scoring model 2 was regressed to (4.1) with . The regression quality is high: in all cases, the adjusted is larger than 99.8%.11 1 The adjusted is calculated as since the number of parameters is .
Figures 4–6 present the mapping and calibration results of scoring model 1 with , and , respectively. The number of counterparties is and the unconditional PD is in all three cases.
(a) | ||||||||
CPs | -value | |||||||
(%) | (%) | (%) | ||||||
1 | 0.0000 | 0.1531 | 0.1531 | 0 | 15.30 | 15.31 | 1.61 | — |
2 | 0.1531 | 0.3457 | 0.3457 | 15.30 | 34.63 | 19.26 | 1.26 | 0.53 |
3 | 0.3457 | 0.5293 | 0.5293 | 34.63 | 53.03 | 18.36 | 0.92 | 0.17 |
4 | 0.5293 | 1.0000 | 0.7488 | 53.03 | 100.00 | 47.07 | 0.75 | 2.96 |
(b) | ||||||||
CPs | -value | |||||||
(%) | (%) | (%) | ||||||
1 | 0.0000 | 0.0478 | 0.0478 | 0 | 4.74 | 4.78 | 3.77 | — |
2 | 0.0478 | 0.1096 | 0.1096 | 4.74 | 10.93 | 6.18 | 3.02 | 3.19 |
3 | 0.1096 | 0.1731 | 0.1688 | 10.93 | 17.37 | 6.35 | 2.44 | 4.55 |
4 | 0.1731 | 0.2462 | 0.2421 | 17.37 | 24.71 | 7.31 | 1.94 | 4.54 |
5 | 0.2462 | 0.3215 | 0.3215 | 24.71 | 32.03 | 7.53 | 1.34 | 0.40 |
6 | 0.3215 | 0.4118 | 0.4118 | 32.03 | 41.05 | 9.03 | 0.97 | 2.68 |
7 | 0.4118 | 0.5127 | 0.5127 | 41.05 | 51.05 | 10.09 | 0.70 | 3.99 |
8 | 0.5127 | 0.6352 | 0.6352 | 51.05 | 63.29 | 12.26 | 0.38 | 0.10 |
9 | 0.6352 | 0.7854 | 0.7854 | 63.29 | 78.34 | 15.01 | 0.21 | 0.92 |
10 | 0.7854 | 1.0000 | 0.9848 | 78.34 | 100.00 | 21.46 | 0.07 | 0.03 |
(c) | ||||||||
CPs | -value | |||||||
(%) | (%) | (%) | ||||||
1 | 0.0000 | 0.0100 | 0.0100 | 0 | 1.01 | 1.00 | 18.33 | — |
2 | 0.0100 | 0.0230 | 0.0230 | 1.01 | 2.34 | 1.30 | 14.92 | 2.87 |
3 | 0.0230 | 0.0355 | 0.0355 | 2.34 | 3.68 | 1.25 | 10.26 | 0.04 |
4 | 0.0355 | 0.0573 | 0.0507 | 3.68 | 5.93 | 2.17 | 8.23 | 4.55 |
5 | 0.0573 | 0.0701 | 0.0701 | 5.93 | 7.22 | 1.29 | 5.83 | 0.87 |
6 | 0.0701 | 0.0929 | 0.0929 | 7.22 | 9.43 | 2.28 | 4.16 | 2.50 |
7 | 0.0929 | 0.1133 | 0.1133 | 9.43 | 11.50 | 2.04 | 2.41 | 0.13 |
8 | 0.1133 | 0.1503 | 0.1420 | 11.50 | 15.22 | 3.70 | 1.65 | 4.55 |
9 | 0.1503 | 0.1827 | 0.1827 | 15.22 | 18.43 | 3.24 | 0.68 | 0.02 |
10 | 0.1827 | 0.2366 | 0.2351 | 18.43 | 23.86 | 5.40 | 0.37 | 4.55 |
11 | 0.2366 | 0.3222 | 0.3222 | 23.86 | 32.25 | 8.55 | 0.09 | 0.04 |
12 | 0.3222 | 1.0000 | 0.5599 | 32.25 | 100.00 | 67.78 | 0.00 | 0.00 |
The dashed lines split the CAP into bands that correspond to a rating: a small (large) bandwidth means low (high) concentration in the rating grade. Each interval corresponds to a score interval , which is tabulated in Table 3. The table demonstrates that the default rate monotonously increases with the rating. Table 3 also presents the critical value as calculated by (3.7). In most cases, ; exceeds the critical value in a few cases due to the following reasons.
- (1)
The critical value is based on an approximation of .
- (2)
In the highest rating, due to Step (4) in the algorithm: when is larger than 1, the rating grade is merged with rating into a single rating grade.
The last column presents the -value of the ARG tests. Since the -values are all below 5%, the rating grades differ significantly in default rate.
Figure 4 shows that the algorithm maps credit scores to only four ratings for a credit scoring system with low discriminatory power. The PDs in each rating grade are small and close to the unconditional PD of 1%. Rating grade 4 exhibits a large concentration of 47%.
Figure 5 presents the ratings and corresponding PDs for a credit scoring system with considerable discriminatory power. The algorithm produces ten ratings and there is a substantial difference between the PDs per grade. The CAP in Figure 5 also shows no excessive concentrations: the largest concentration is 21%, occurring in the rating grade with the lowest risk.
Figure 6 demonstrates the results of the algorithm for a credit scoring system with an close to 1, ie, the discriminatory power is extremely high. As expected, the algorithm only distinguishes significantly different rating grades at small -values. The algorithm produces narrow bands in this range and the rating corresponding to the lowest default risk has an excessive concentration of 68%. As explained in Section 2, the mapping of a highly discriminatory credit scoring model leads to a trade-off between avoiding excessive concentrations and identifying significantly different rating grades.
An important question is how many rating grades result from the algorithm. Equation (3.6) shows that relates to the concave shape of the CAP, the number of counterparties and the unconditional PD. When the CAP is more concave, the is higher, and more significantly different rating grades result. Figures 4–6 also support this. To identify the relation between and , we examined the following cases.
First, we use a theoretical case in which equals (4.1) with , giving the following :
(5.4) |
We calculate the number of rating grades () theoretically by iteration until :
(5.5) |
for several values of . The accuracy ratio follows from
(5.6) |
Second, we investigated the relation between and the by simulating scoring model 1 and scoring model 2 for 100 000 counterparties with different parameters and , such that the varied between 7% and 96%, while the unconditional PD equals 1%.
Figure 7 presents the number of rating grades versus for scoring model 1 and scoring model 2. As expected, the figure shows that the number of rating grades increases with the accuracy ratio. The dashed line in Figure 7 presents the , derived by iterating (5.5), versus the as calculated by (5.6). The observations of model system 1 and model system 2 resemble the theoretically calculated dashed line.
A direct relation between and follows from the relation between the and the concave shape of the CAP: is more negative when the is high. Therefore, depends on the by
Substituting this dependency in (3.8) and (3.9) gives
(5.7) |
A regression shows that for scoring model 1, for scoring model 2 and for the theoretical curve. This means that all graphs closely resemble (5.7), although Figure 6 shows that the mapping algorithm does not always result in a homogeneous distribution over the ratings. Therefore, (5.7) is only a rule-of-thumb to predict how many rating grades can be distinguished for a given accuracy ratio.
We investigated (5.7) further by performing simulations of both scoring models with different unconditional PDs. Figure 8 presents versus . A regression shows that for scoring model 1 and for scoring model 2. The exponentials are close to 0.33, as predicted by (5.7). We conclude that this equation explains to a large extent the power law behavior of as a function of and .
We also investigated how much discriminatory power is lost by mapping scores to ratings, using the information loss measure as defined in (2.13). Figure 9 reveals that the information loss decreases exponentially with increasing . Therefore, the information loss for both scoring models is regressed against the following equation:
(5.8) |
Equation (5.8) has no intercept since the information loss converges to when , as explained in Section 2.
Scoring model 1 | Scoring model 2 | |||||
-value | -value | |||||
Coeff. | SE | (%) | Coeff. | SE | (%) | |
Coefficient | 0.54 | 0.00 | 0.81 | 0.00 | ||
of () | ||||||
Coefficient | 0.65 | 0.00 | 1.09 | 0.01 | ||
of () | ||||||
92.1% | 80.5% |
Table 4 presents the regression coefficients and the . The coefficients of both scoring models agree within a 95% confidence interval. Therefore, we conclude that the information loss as a function of the is similar for both scoring models, although the shows that the quality of the fit for scoring model 1 is considerably better than for scoring model 2. Further, the information loss is low for credit scoring models with moderate to high discriminatory power: when the is higher than 60%, the information loss is lower than 1%.
Since the algorithm depends on the population default rate and discriminatory power, we conclude with two considerations. First, both quantities are inferred from the data. Obviously, data flaws, such as assigning the information of a defaulted counterparty to a nondefault status, directly impact the estimation of and therefore , but they also lead to an underestimated accuracy ratio (Stein 2016; Russell et al 2012). This causes a widening of the intervals . The rating grades still represent significantly different default risks, but (5.7) shows that the possible number of rating grades is underestimated, and the information loss increases according to (5.8) and Figure 9. Stein (2016) suggests procedures for estimating the number of mislabeled defaults and correcting the power statistics for these errors.
Second, the population default rate and accuracy ratio are related via the default definition. For example, when the default definition changes from payments ninety days in arrears to payments ten days in arrears, relatively solvent counterparties also count as default. The population default rate increases but the accuracy ratio may decrease. Equation (5.7) shows that the accuracy ratio has more of an impact than the , and the number of rating grades will be lower in case of a stricter default definition. The decrease in the accuracy ratio also leads to an increase in information loss according to (5.8) and Figure 9. Therefore, successful application of the algorithm also depends on the data quality and a proper default definition.
6 Conclusion
This paper presents a method for mapping credit scores to ratings and calibrating these ratings by assigning a PD per rating grade. The algorithm is based on two requirements: (1) monotonicity and stability of the PD per rating by enforcing significantly different default risks between adjacent rating grades; and (2) avoiding excessive concentrations. Our algorithm uses stepwise partitioning of the -axis of the CAP.
We tested the algorithm by simulating two scoring models. The algorithm maps credit scores to significantly different ratings. The algorithm shows that rating calibration is a trade-off between optimizing discriminatory power, avoiding concentrations in rating grades, and calibration quality in terms of significantly different rating grades. Highly discriminatory credit scoring models lead to an excessive concentration in the low-risk rating grade. High concentrations are avoided by accepting a lower significance level in defining different ratings or designing credit scoring models with moderate rather than high discriminatory power. The algorithm maps credit scores to ratings, but a small amount of information is lost, as ratings are less granular than credit scores.
The simulations of the two different scoring models reveal that the number of rating grades, which result from the algorithm, exhibit a power law behavior as a function of the accuracy ratio of the credit scores and the unconditional PD. The information loss decreases with an increasing accuracy ratio of the credit scoring model. This power law behavior and the information loss is similar for both scoring models, although these scoring models are based on different score distributions and conditional PDs. Our overall conclusion is that the algorithm is most sensitive to the discriminatory power of the scoring model, but less dependent on the underlying score distribution. Further, the algorithm outcomes are sensitive to the data quality by the “garbage in, garbage out” principle as well as to the default definition.
Declaration of interest
The views expressed in this paper are those of the author and do not necessarily reflect those of Nationale Nederlanden Group. The author would like to thank the anonymous referees for their suggestions.
References
- Araten, M. (2007). Development and validation of key estimates for capital models. In The Basel Handbook, Ong, M. (eds), 2nd edn. Risk Books London.
- Basel Committee on Banking Supervision (2004). International convergence of capital measurement and capital standards (a revised framework). Report, Bank for International Settlements.
- Basel Committee on Banking Supervision (2016). Reducing variation in credit risk-weighted assets – constraints on the use of internal model approaches. Consultative Document, Bank for International Settlements.
- Engelmann, B., Hayden, E., and Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion Paper Series 2: Banking and Financial Studies 2003.01, Deutsche Bundesbank.
- European Banking Association (2016). Final Draft Regulatory Technical Standards on the specification of the assessment methodology for competent authorities regarding compliance of an institution with the requirements to use the IRB approach in accordance with Articles 144(2), 173(3) and 180(3)(b) of Regulation (EU) No 575/2013. EBA/RTS/2016/03, July 21, EBA.
- Falkenstein, E., Boral, A., and Carty, L. (2000). RiskCalc for private companies: Moody’s default model. Global Credit Reseach, Moody’s Investor Service.
- Fleis, J. L., Levin, B., and Paik, M. C. (2003). Statistical Methods for Rates and Proportions, 2nd edn, Wiley (https://doi.org/10.1002/0471445428).
- Hayden, E., and Porath, D. (2006). Statistical methods to develop rating models. In The Basel II Risk Parameters, Estimation, Validation Stress Testing – With Applications to Loan Risk Management, Engelmann, B., Rauhmeier, R. (eds), 2nd edn. Springer (https://doi.org/10.1007/3-540-33087-9_1).
- Irwin, R. J., and Callaghan, K. S. N. (2006). Ejection decisions by strike pilots: an extreme value interpretation. Aviation, Space, and Environmental Medicine 77, 62–64.
- Irwin, R. J., and Irwin, T. C. (2012). Appraising credit ratings: does the CAP fit better than the ROC? Working Paper WP/12/122, International Monetary Fund.
- Russell, H., Tang, Q. K., and Dwyer, D. W. (2012). The effect of imperfect data on default prediction validation tests. The Journal of Risk Model Validation 6(1), 1–20 (https://doi.org/10.21314/JRMV.2012.085).
- Sheskin, D. J. (2000). Handbook of Parametric and Non-Parametric Statistical Procedures, 2nd edn. Chapman & Hall/CRC.
- Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation. The Journal of Risk Model Validation 1(1), 77–113 (https://doi.org/10.21314/JRMV.2007.002).
- Stein, R. M. (2016). Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC. Data Mining and Knowledge Discovery 30(4), 763–796 (https://doi.org/10.1007/s10618-015-0437-7).
- Tasche, D. (2005). Rating and probability of default validation. Working Paper 14, Studies on the Validation of Internal Rating Systems, Basel Committee on Banking Supervision, Bank for International Settlements.
- Tasche, D. (2010). Estimating discriminatory power and PD curves when the number of defaults is small. Working Paper, Lloyds Banking Group.
- van der Burgt, M. J. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net