Journal of Risk Model Validation

Risk.net

Consensus information and consensus rating: a simulation study on rating aggregation

Christoph Lehmann and Daniel Tillich

  • The aggregation of single ratings to get a higher precision seems to be a questionable issue, at least in the case of the logit model.
  • Only the consensus rating provides an advantage referring the precision. 
  • Up to now, there is no preferable strategy to get a more precise rating by aggregating single ratings

ABSTRACT

The aggregation of different single ratings into a so-called consensus rating in order to get a higher-precision debtor's default probability is an idea that is hardly discussed in the literature. In their 2013 paper "Deriving consensus ratings of the big three rating agencies", Grün et al came up with a method for rating aggregation, whereby the term "consensus rating" was introduced. To sharpen the whole issue of rating aggregation from a theoretical perspective, in their 2016 paper "Consensus information and consensus rating: a note on methodological problems of rating aggregation", Lehmann and Tillich developed a framework in which the terms "consensus rating" and "consensus information" are clearly defined. The paper at hand tries to connect the two aforementioned contributions and applies the theoretical framework of Lehmann and Tillich in connection with some of the practical ideas of Grün et al. In contrast to Grün et al, a simulation approach is chosen in order to have a clear benchmark for assessing the rating aggregation outcomes. Thereby, the following questions should be clarified. Does rating aggregation really lead to a higher precision of the estimated default probabilities? Is there a preferable aggregation method? Does the consensus rating, as defined by Lehmann and Tillich, outperform other aggregation methods? The simulation results show that rating aggregation could be a puzzling issue.

The problem of rating aggregation seems not to have be discussed too much in the literature of the past few years. One of the first papers in this field is Grün et al (2013), in which an approach is developed that combines different ratings into a so-called consensus rating. In contrast, Lehmann and Tillich (2016) discuss the issue from a slightly more basic perspective. They asked what a “consensus rating” is and in which cases this concept makes sense. In Grün et al (2013), the suggested model for rating aggregation is applied to a real data set. Thereby, the main problem is that the true default probabilities are unknown and the benchmark chosen is only data driven. The approach of the contribution at hand is a simulation, whereby an artificial world with known default probabilities (based on a logit model) is constructed, ie, a fixed benchmark is used in contrast to Grün et al (2013). As already mentioned in Lehmann and Tillich (2016), there are many aspects leading to different estimates of the same default probability, eg, different models, estimation methods, data sets and time horizons. Further, the concept of consensus rating seems to be a theoretical issue, because rating agencies strive to get more precise information than their competitors, and the interchange and recombination of information in particular is not desired. Using a simulation, as is done in this paper, contains the possibility to control these manifold aspects. As a consequence, it becomes possible to gain insights into the performance of different aggregation approaches.

This paper is organized as follows. In Section 2, some notation and theoretical background is introduced in brief. Section 3 contains the settings of the simulation and the methods of rating aggregation. Section 4 presents the results. Finally, the paper ends with some conclusions in Section 5.

2 Information set and consensus information

For the following notation and assumptions, see Lehmann and Tillich (2016), Sections 1 and 2. The credit default of debtor i is modeled by a random variable Yi, i=1,,n. It takes a value of 1 when debtor i defaults, and 0 otherwise. Thus, P(Yi=1) is the unconditional default probability of debtor i. In order to estimate individual default probabilities, typically several rating characteristics are taken into account. These rating characteristics are modeled by a subject-specific real random vector

  ?i=def(Xi1,Xi2,,XiK)  

with realization

  ?i=def(xi1,xi2,,xiK)K,i=1,,n.  

Then, the probability of interest in a rating process is the conditional default probability P(Yi=1?i=?i).

In the following, an institution that assigns ratings is called a rating agency. This could also be a bank that evaluates the debtors. In the style of Lehmann and Tillich (2016), it is assumed that all rating agencies are using the same vector ?i. This assumption is needed in the following for reasonable set operations. In this framework, the differences do not lie in the rating characteristics ?i themselves, but in the information about them.

The situation in which all the values of debtor i are known is called complete information. The corresponding information set is Ii={?i}={xi1}×{xi2}××{xiK}. It is a singleton. Typically, rating agencies do not have complete information. They know and use only subvectors of ?i. The subvector belonging to rating agency r=1,,R is denoted by ?ri. Its corresponding information set Iri differs from the complete information set Ii in that the unknown values are replaced by the set of real numbers. It is assumed that a rating agency either knows the exact value of a rating characteristic or knows nothing about it. It follows that Ii={?i}Iri. The more information that is available, the more precise the information set. Note that, within our framework, higher precision in an information set is connected with lower cardinality.

Example 2.1.

This example corresponds to the information set of rating agency 1 in the simulation scenario B2 below (see Section 3 and Table 3). Rating agency r=1 has information only about the rating characteristics 1, 3, 5 and 7, ie, ?1i=(xi1,xi3,xi5,xi7). Assuming vector ?i consists of K=8 components, the resulting information set is I1i={xi1}××{xi3}××{xi5}××{xi7}×.

Because of their different information sets, rating agencies estimate different conditional default probabilities P(Yi=1?iIri). In order to estimate the same probability, and to get as close to complete information as possible, the information sets of the rating agencies should be merged in this framework. This leads to a consensus in information. Since it is assumed that there is no contradictory information, the consensus information set of the ith debtor Ii can be defined as (see Lehmann and Tillich (2016), Definition 1)

  Ii=defr=1RIri.   (2.1)

Thus, the consensus information set Ii is at least as precise as every single agency-specific information set Iri. As already noted above, higher precision in an information set is connected with lower cardinality. It holds that Ii={?i}IiIri for all r=1,,R, since ?i is included in every single agency-specific information set Iri. Based on the consensus information set Ii, the subvector ?i can be constructed. It contains all values of the rating characteristics that are known to at least one rating agency.

Example 2.2.

Let

  I1i ={xi1}××{xi3}××{xi5}××{xi7}×  
as in Example 2.1, and
  I2i ={xi1}×{xi2}×××{xi5}×××{xi8}.  

Then the consensus information set corresponding to the information sets of the two rating agencies is

  Ii=I1iI2i={xi1}×{xi2}×{xi3}××{xi5}××{xi7}×{xi8},  

and ?i=(xi1,xi2,xi3,xi5,xi7,xi8) is the resulting subvector of the rating characteristics.

3 Simulation

Every information set leads to a conditional default probability, eg, P(Yi=1?iIri)]0,1[. A rating agency calculates an estimate for this unknown default probability. Rating aggregation typically means aggregating such different estimates. To get a consensus rating in the sense of estimating the same target, rating agencies would have to use the same information set, eg, the consensus information set. In this section, some different aggregation variations as well as the concept of consensus rating are simulated and compared. Keeping in mind the theoretical considerations from above, it should be clarified whether a real consensus rating performs better than other forms of rating aggregation.

The basic idea for the simulation is as follows. First, an artificial world is constructed by a complete specified logit model, in which all real default probabilities are known. Second, defaults are simulated based on these known probabilities. Third, on this data set, the default probabilities are estimated, given different subsets of the information set that is used at the beginning. Fourth, and finally, the estimated probabilities are aggregated in different ways. Thus, a comparison between the different estimated probabilities and the “true” probabilities is possible. The second to fourth steps are replicated in a Monte Carlo simulation. As a basis of the simulation, the logit function is needed. It is defined as

  G(z)=def11+e-zfor all z.  

Its inverse is

  G-1(π)=ln(π1-π)for all 0<π<1.  

The logit model is chosen here because it is the most-used standard model for scoring purposes (see Anderson (2007, p. 42) or Mays (2004, p. 65)).

In the following, all steps are described in detail.

Step 1. At first, the rating characteristics and the corresponding default probability for all of the n debtors are needed. To this end, n realizations ?i=(xi1,,xiK) of independent and identically distributed (iid) random vectors

  ?i=(Xi1,,XiK)iidF?  

are generated with an arbitrary distribution F?. Thus, Ii={?i}. Next, the true default probability for debtor i is calculated by

  π(Ii)=defP(Yi=1?iIi)=defG(β0+??i)i=1,,n,   (3.1)

where β0 and ?=def(β1,,βK)K denote the known parameters.

Step 2. Based on the true default probability π(Ii), one realization yi of the default variable Yi is simulated for every debtor i=1,,n.

Step 3.

  1. (a)

    Based on the default data from Step 2, the parameters β0,β1,,βK of (3.1) are estimated. With the estimates β^0 and ?^=(β^1,,β^K), the default probability π(Ii) is estimated by

      π^(Ii)=defG(β^0+?^?i).  
  2. (b)

    Additionally, some modifications of (3.1) are estimated. The modifications refer to the consideration of different rating agency-specific information sets Iri. In detail, the rating agencies know and/or use only subvectors of ?i. These subvectors ?ri have a dimension 0<KrK. From these subvectors and the default data, a reduced number of parameters is estimated, namely βr0 and ?r=def(βr1,,βrKr). With the estimates β^r0 and ?^r=(β^r1,,β^rKr), the agency-specific default probabilities

      π(Iri)=P(Yi=1?iIri),i=1,,n,r=1,,R,   (3.2)

    are estimated by

      π^(Iri)=defG(β^r0+?^r?ri).   (3.3)
  3. (c)

    Last, the consensus information set Ii from (2.1) is used for estimation. From the corresponding subvectors ?i and the default data, the default probabilities π(Ii) are estimated by π^(Ii), analogously to Step 3(b).

All estimations are done by maximum likelihood estimation.

Step 4. For all debtors i=1,,n, aggregated default probabilities are derived from the rating agencies’ estimates π^(Iri), r=1,,R, in (3.3). This is the investor’s or external perspective, which means no information about the rating characteristics is used.

  1. (a)

    An aggregated default probability is calculated as an arithmetic mean:

      π¯i=def1Rr=1Rπ^(Iri),i=1,,n.  
  2. (b)

    Another form of aggregation is the geometric mean

      π¯Gi=defr=1Rπ^(Iri)R,i=1,,n.  
  3. (c)

    Taking into account the benchmark idea of Grün et al (2013, p. 82), a third aggregation method based on the so-called Z-scores is used. Formally, the Z-scores are simply the estimated linear predictors of the logit model in (3.3), ie, zri=defG-1(π^(Iri))=β^r0+?^r?ri. Calculating their arithmetic mean

      z¯i=def1Rr=1Rzri,i=1,,n,  

    finally leads to the corresponding aggregated default probability π¯Si=defG(z¯i).

Steps 2–4 are replicated m times, ie, m defaults are simulated for each individual (based on the default probability from Step 1). In contrast to Grün et al (2013), there are no time dynamics considered; it is a simulation only for one period of time, and this is replicated independently m times. Typically, one needs to distinguish between point-in-time (PIT) and through-the-cycle (TTC) default probabilities. Basically, the whole aggregation issue can be applied to PIT ratings as well as TTC ratings. The only requirement therein is a need for the same time horizon for all single ratings being aggregated. The resulting rating aggregation then refers to the same time horizon as all the single ratings. From Step 1 in the simulation setting, it follows that all debtors are independent. Based on a typical assumption in credit risk, this can be interpreted as a PIT framework, where the defaults are stochastically independent conditional on an unobservable, systematic risk factor. Actually, the distinction between PIT and TTC is an issue of dependency between different debtors, but not between different single ratings referring to one debtor. These single ratings are, of course, highly dependent.

In the sense of Lehmann and Tillich (2016, Section 3), the estimated probabilities π^(Ii) from Step 3(c) constitute “consensus ratings”, because they are based on the consensus information set. In contrast, the aggregated estimates π¯i, π¯Gi and π¯Si are based on generally different information sets Iri. Hence, they should be denoted as “compromise ratings”, as mentioned in Lehmann and Tillich (2016, Section 4).

There are three main advantages of the simulation framework above.

  1. (1)

    The true default probability as a fixed benchmark is known.

  2. (2)

    In the artificial world, a real consensus information in the sense of (2.1) is possible, and therewith a real consensus rating exists.

  3. (3)

    The aggregated ratings are based on the same model, estimation method, data set and time horizon. Only the information sets differ. The main interest within this contribution lies in the role of the information set for aggregated ratings. Therefore, the simulation scenarios only differ with regard to the used information set, while the whole framework around them is fixed. As already mentioned in Lehmann and Tillich (2016, pp. 361–362), different models, estimation methods and data sets lead to different estimates of default probabilities. These effects may interfere with the influence of the information set on the rating aggregation.

Simulations as described above are performed with settings as follows.

  • Number of rating agencies: R=3.

  • Number of debtors: n=4000.

  • Number of Monte Carlo replications: m=1000.

All simulations were implemented with GAUSS, Version 16.0.1 (seed for random numbers: 1 664 525) using the MAXLIK package, Version 5.0.9. The graphics were generated with R, Version 3.1.2 (R Core Team 2014), and the package ggplot2, Version 1.0.1.

Inspired by Grün et al (2013), who modeled rating aggregation of the three big rating agencies (Standard & Poor’s (S&P), Moody’s and Fitch), we consider three rating agencies as well. Two basic scenarios, A and B, of simulations are investigated. Their specifications are given in columns 2 and 3 of Tables 2 and 3. Both scenarios contain the same types of distributions for the regressor variables, namely lognormal, Poisson and Bernoulli distributions. The lognormal distribution producing only positive real numbers could indicate income, for example. The Poisson distribution produces nonnegative integers and could stand for the number of loans or the number of credit cards a person already has. Finally, the Bernoulli distribution with values 0 and 1 indicates some dichotomous characteristic, like sex. The coefficients β0,β1,,βK are set in this way to get two very different scenarios for the generated default probabilities. Scenario A contains much higher default probabilities than scenario B. Thus, scenario B forms the more practical scenario, especially in the case of credit defaults. Nonetheless, in the face of a crisis, scenario A seems not to be so absurd. Histograms of the true default probabilities π(Ii) for A and B, as calculated in Step 1, are illustrated in Figure 1. Additionally, Table 1 provides some descriptive statistics.

For both scenarios A and B of default probabilities, four subscenarios are considered. The subscenarios 1–4 differ in the information sets of the rating agencies and therefore in the resulting consensus information set. The rating characteristics used for estimating can be read in the “Choice matrix” column of Tables 2 and 3, as follows. Every column of the choice matrix stands for one of the rating characteristics that are used, ie, every column symbolizes one regressor variable from the logit model in (3.1). Thereby, the first column represents the intercept (this is the coefficient β0), and columns 2–5 represent the rating characteristics xi1,,xiK (with the coefficients β1,,βK). Every row of the matrix stands for one of the model modifications in (3.1) and (3.2). More precisely, the first row represents the “true-world” model; the second, third and fourth rows represent the R=3 different rating agencies; and the fifth row contains the case of the consensus information (see Step 3(c)). Thereby, the choice matrix only contains ones and zeros, which indicate whether the kth regressor variable is used (1) or not (0) within the considered model modification. Within the framework from Section 2, for every regressor variable not used by rating agency r, it holds that Xik. Therefore, the regressor variable Xik can be omitted in the estimation. In the case of used regressor variables, it holds that Xik=xik, ie, the rating agency knows the correct value of debtor i.

Histograms of the true default probabilities (log scale) for simulation scenarios A and B
Figure 1: Histograms of the true default probabilities (log scale) for simulation scenarios A and B.
Table 1: Descriptive statistics of true default probabilities for scenarios A and B.
Descriptive statistics of    
true default probabilities Scenario A Scenario B
  25% quantile 1.069E-02 5.472E-06
  Median 7.789E-02 1.215E-04
  75% quantile 3.273E-01 1.567E-03
  90% quantile 6.764E-01 9.175E-03

4 Results

In the following, comparisons are made between the results of the different approaches from above. At first, the plausibility of the estimated values is assessed by error rates. The error rates describe the portion of debtors for which the true default probability does not lie between the minimum and maximum of the estimated default probabilities over all m=1000 Monte Carlo replications. For formal notation, let |A| denote the number of elements of an arbitrary set A. Then, the error rates are defined as follows, r=1,2,3:

  e=def1-|A|n,er=def1-|Ar|n,e=def1-|A|n,  
  e¯=def1-|A¯|n,e¯G=def1-|A¯G|n,e¯S=def1-|A¯S|n,  

with

  A =def{i:minj=1mπ^(j)(Ii)π(Ii)maxj=1mπ^(j)(Ii)}(complete information),  
  Ar =def{i:minj=1mπ^(j)(Iri)π(Ii)maxj=1mπ^(j)(Iri)}(information of agency r),  
  A =def{i:minj=1mπ^(j)(Ii)π(Ii)maxj=1mπ^(j)(Ii)}(consensus information),  
  A¯ =def{i:minj=1mπ¯i(j)π(Ii)maxj=1mπ¯i(j)}(arithmetic mean approach),  
  A¯G =def{i:minj=1mπ¯Gi(j)π(Ii)maxj=1mπ¯Gi(j)}(geometric mean approach),  
  A¯S =def{i:minj=1mπ¯Si(j)π(Ii)maxj=1mπ¯Si(j)}(Z-score approach),  

where the term (j) indicates the corresponding value of the jth Monte Carlo replication. For instance, π^(j)(Ii) denotes the estimated default probability of debtor i in Monte Carlo replication j using the consensus information set Ii.

Table 2: Simulation scenarios A1–A4 with r=1,2,3, i=1,,4000, j=1,,1000 and corresponding error rates.
          Error
  Rating     Error rate
Scenario characteristics (??,??,,??) Choice matrix rates ?¯
A1 Xi1LN(0.25,0.35), Xi2,Xi3,Xi4iidPoi(0.4), Xi5,Xi6,Xi7,Xi8iidBer(0.5). [4pt] All regressor variables are mutually independent. (-2-1-2-3-12211)

[111111111111110000111101000111100100111111100]

e=0.00e1=0.85e2=0.83e3=0.83e=0.43

}e¯=0.85e¯G=0.86e¯S=0.86
A2

[111111111111110001101101111111100110111111111]

e=0.00e1=0.85e2=0.53e3=0.82e=0.00

}e¯=0.79e¯G=0.76e¯S=0.78
A3

[111111111111111101101111111111011111111111111]

e=0.00e1=0.46e2=0.25e3=0.69e=0.00

}e¯=0.53e¯G=0.48e¯S=0.49
A4

[111111111110100001110101001101100000111101001]

e=0.00e1=0.88e2=0.86e3=0.90e=0.81

}e¯=0.91e¯G=0.90e¯S=0.91
Table 3: Simulation scenarios B1 to B4 with r=1,2,3, i=1,,4000, j=1,,1000 and corresponding error rates.
          Error
  Rating     Error rate
Scenario characteristics (??,??,,??) Choice matrix rates ?¯
B1 Xi1,Xi2iidLN(0.25,0.75), Xi3,Xi4iidPoi(0.8), Xi5,Xi6,Xi7,Xi8iidBer(0.1). [4pt] All regressor variables are mutually independent. (-2-1-2-3-12211)

[111111111111111101101111111111011111111111111]

e=0.000e1=0.000e2=0.012e3=0.085e=0.000 }e¯=0.083e¯G=0.000e¯S=0.000
B2

[111111111110101010101010101111001001111111111]

e=0.000e1=0.175e2=0.351e3=0.273e=0.000 }e¯=0.456e¯G=0.040e¯S=0.021
B3

[111111111111000000100110000101001011111111111]

e=0.000e1=0.341e2=0.290e3=0.422e=0.000 }e¯=0.624e¯G=0.054e¯S=0.051
B4

[111111111110100001110101001101100000111101001]

e=0.000e1=0.213e2=0.170e3=0.176e=0.033 }e¯=0.285e¯G=0.181e¯S=0.203
Remark 4.1.

The measures of scoring quality that are used most often include the accuracy ratio, area under the curve, Kolmogorov–Smirnov statistic or conditional information entropy ratio. But these measures focus on discriminatory aspects rather than on the precision of default probabilities (see Tasche 2008), which is the main target of this contribution. Thus, we use measures that aim at precision.

In the following, some important results from and remarks on Tables 2 and 3 are reported.

  1. (1)

    Basically, all the results of aggregated ratings are quite unsatisfactory, due to the obtained error rates. Error rates of about 80% are quite high, and such outcomes seem not to be appropriate for practical use. Additionally, note that an error rate as defined above constitutes a quite rough measure of quality.

  2. (2)

    Regarding the aggregation method, neither the arithmetic mean, the geometric mean nor the Z-score mean seems to be preferable in scenarios A1–A4.

    In A4, the smallest amount of information about the rating characteristics is used; in particular, the consensus information (six out of nine regressor variables) is not identical to the complete information as it is in scenarios A2 and A3. The property IiIi is also valid in scenario A1. But, interestingly, in A1 the error rate e is essentially lower than in A4. This illustrates the influence of single regressor variables on the linear predictor in the logit model. The influence of single regressor variables on the linear predictor and the outcome also can be seen in A3, where only one variable is missing for every rating agency. These little differences lead to very different error rates, especially referring to e2. These varying outcomes depend strongly on the type of regressor variables as well as their interaction with the linear predictor.

  3. (3)

    The error rates of the B scenarios are remarkably lower than in the A scenarios. In all B scenarios (except for e3 in B1), the single ratings perform better than the aggregation with the arithmetic mean. Because of the complex model structure of the logit model, it is very difficult to explain this phenomenon from a theoretical perspective based on the model background. Further, the extent of the simulations above is too small to derive any essential tendency based on the empirical outcome.

    Using the geometric mean or the Z-score for aggregation leads to lower error rates than the arithmetic mean, but it does not lead to an improvement against the single ratings in every case (see B4).

Calculating only the error rates does not take into account any considerations referring to the preciseness of the results. Having small error rates is not enough. The calculated default probability should come along with a small distance to the true default probability. This is assessed by the mean distance between the true probability π(Ii) and the corresponding estimated or aggregated probability:

  si =1mj=1m|π^(j)(Ii)-π(Ii)|(complete information),   (4.1)
  si =1mj=1m|π^(j)(Ii)-π(Ii)|(consensus information),   (4.2)
  s¯i =1mj=1m|π¯i(j)-π(Ii)|(arithmetic mean approach),   (4.3)
  s¯Gi =1mj=1m|π¯Gi(j)-π(Ii)|(geometric mean approach),   (4.4)
  s¯Si =1mj=1m|π¯Si(j)-π(Ii)|(Z-score approach),   (4.5)

where the term (j) again indicates the jth Monte Carlo replication. Because this contribution is on aggregation issues, the corresponding quantities for single rating agencies are omitted. Tables 4 and 5 contain an overview of the descriptive statistics for the mean absolute deviations (4.1)–(4.5) for all simulation scenarios.

Table 4: Mean absolute deviations for A scenarios.
  Descriptive          
  statistics ?? ?? ?¯? ?¯?? ?¯??
A1 1st quartile 1.892E-03 6.410E-03 2.356E-02 2.143E-02 2.955E-02
  Median 8.944E-03 1.942E-02 6.356E-02 6.199E-02 7.553E-02
  3rd quartile 2.073E-02 6.856E-02 1.890E-01 1.638E-01 1.977E-01
A2 1st quartile 1.892E-03 1.892E-03 1.601E-02 1.352E-02 1.764E-02
  Median 8.944E-03 8.944E-03 5.150E-02 4.267E-02 5.404E-02
  3rd quartile 2.073E-02 2.073E-02 1.384E-01 1.112E-01 1.383E-01
A3 1st quartile 1.892E-03 1.892E-03 1.126E-02 6.522E-03 1.136E-02
  Median 8.944E-03 8.944E-03 2.996E-02 2.164E-02 2.545E-02
  3rd quartile 2.073E-02 2.073E-02 7.388E-02 5.596E-02 5.092E-02
A4 1st quartile 1.892E-03 1.778E-02 3.453E-02 2.932E-02 3.860E-02
  Median 8.944E-03 6.087E-02 1.208E-01 1.101E-01 1.401E-01
  3rd quartile 2.073E-02 1.905E-01 2.194E-01 2.012E-01 2.403E-01
Table 5: Mean absolute deviations for B scenarios.
  Descriptive          
  statistics ?? ?? ?¯? ?¯?? ?¯??
B1 1st quartile 1.114E-05 1.114E-05 9.257E-05 2.957E-05 4.300E-05
  Median 1.385E-04 1.385E-04 4.958E-04 2.248E-04 3.180E-04
  3rd quartile 1.090E-03 1.090E-03 2.085E-03 1.245E-03 1.645E-03
B2 1st quartile 1.114E-05 1.114E-05 7.891E-04 2.079E-04 3.000E-04
  Median 1.385E-04 1.385E-04 1.867E-03 7.203E-04 1.103E-03
  3rd quartile 1.090E-03 1.090E-03 4.267E-03 1.628E-03 2.450E-03
B3 1st quartile 1.114E-05 1.114E-05 1.191E-03 3.337E-04 4.820E-04
  Median 1.385E-04 1.385E-04 2.734E-03 1.103E-03 1.616E-03
  3rd quartile 1.090E-03 1.090E-03 5.521E-03 2.295E-03 3.419E-03
B4 1st quartile 1.114E-05 2.960E-05 2.695E-04 1.325E-04 1.910E-04
  Median 1.385E-04 3.218E-04 1.074E-03 7.159E-04 1.013E-03
  3rd quartile 1.090E-03 2.084E-03 6.023E-03 3.656E-03 5.493E-03

In the following, some important results from and remarks on Tables 4 and 5 are reported.

  1. (1)

    There are identical values in the respective third column (si) of Tables 4 and 5 over all subscenarios within A or B, because the basis here is the full model with all regressor variables.

  2. (2)

    The mean absolute deviations in the A scenarios are bigger than in the B scenarios, which is probably mainly caused by the different dimensions of the true default probabilities (see Figure 1). In comparison with the estimation based on the complete information, the mean deviations (referring to the corresponding quantiles of the descriptive statistics) are higher mainly up to factor 12 in the A scenarios, and up to factor 7 in the B scenarios. In relation to the very small default probabilities, this is not necessarily a problem if the estimated default probabilities are transformed into an ordinal scale with rating classes. Thereby, such small default probabilities typically constitute the best rating class.

  3. (3)

    Out of all the A scenarios, A4 contains the biggest deviations, regardless of the aggregation method. In A4, the smallest amount of information is used; in particular, the consensus information (six out of nine regressor variables) is not identical to the complete information like in scenarios A2 and A3. Also in scenario A1, the consensus information does not equal the complete information.

  4. (4)

    Referring to the aggregation, the A scenarios as well as the B scenarios show higher absolute deviations for the arithmetic mean and the Z-score aggregation in comparison with the geometric mean. The Z-score aggregation shows higher mean absolute deviations than the arithmetic mean in the A scenarios, and smaller ones than the arithmetic mean in the B scenarios.

The absolute values of deviation still hide the direction of the error that is made. In order to get some insight into this issue, the rates of underestimated default probabilities are investigated. This underestimation rate is defined as

  ui=def|Ai|m,  

with Ai=def{j:π^(j)(Ii)<π(Ii)}. Analogously, the underestimation rates ui, u¯i, u¯Gi and u¯Si are defined with the sets

  Ai =def{j:π^(j)(Ii)<π(Ii)}, A¯i =def{j:π¯i(j)<π(Ii)},  
  A¯Gi =def{j:π¯Gi(j)<π(Ii)}, A¯Si =def{j:π¯Si(j)<π(Ii)}.  

Figure 2 illustrates all these underestimation rates by means of hexbinplots. Hexbinplots are a kind of aggregated scatterplot. Where there is no data, there is no hexbin. The more data there is in a hexbin, the darker it is. Since the main target here is insight into aggregation issues, the single rating agencies are not considered for the problem of underestimation. Apart from that, the graphs for the single rating agencies do not provide any additional information on or new insights into the basic problem of aggregation.

Hexbinplots of the rates of underestimated default probabilities for all A and B scenarios
Figure 2: Hexbinplots of the rates of underestimated default probabilities for all A and B scenarios.

For the hexbinplots in Figure 2, the following observations can be made.

  1. (1)

    Note the difference between the empirical distributions of the true default probabilities in the A and B scenarios for the following remarks. A large number of low default probabilities leads to high absolute frequencies near zero along the horizontal axes. The main target of the hexbinplots is the illustration of the underestimation rates depending on the true default probabilities. Therefore, every hexbinplot has to be read mainly in a “vertical” direction.

  2. (2)

    If there is no tendency of overestimation or underestimation, the empirical distributions of the underestimation rates are mainly concentrated around 0.5. This can be seen if the models are estimated with complete information (ui).

  3. (3)

    In the A and B scenarios, there is a tendency to overestimate the low default probabilities (underestimation rates near zero) and underestimate the higher default probabilities (underestimation rates near one). When default probabilities are near the bounds of the interval [0,1], there is a larger amount of possible probability estimates that overestimate low probabilities and underestimate high probabilities.

  4. (4)

    From the aggregation perspective, there seems to be no preferable kind of aggregation regarding under- or overestimation. The plots for u¯i,u¯Gi and u¯Si are almost equal for all scenarios. If consensus information and complete information are not identical, ie, IiIi (see scenarios A1, A4 and B4), the plot for ui is also quite similar to them.

  5. (5)

    As mentioned above, nearly all scenarios show higher absolute deviations and higher error rates for the arithmetic mean than the geometric mean. This fact becomes clear when we take a look at the underestimation rates. For positive values, the arithmetic mean is an upper bound for the geometric mean (see Hardy et al 1988, Theorem 9, p. 17). If there is a tendency to overestimate the low default probabilities, which form the majority in the scenarios here (see Figure 1), an aggregation by the geometric mean acts like a correction under such conditions. As a consequence, this is one possible explanation for the corresponding lower error rates in Tables 2 and 3.

  6. (6)

    The outcome of the aggregation based on the Z-score depends on the properties of the logit function G in (3.1). Because of its S-shaped form, the logit function has a convex part and a concave part as well. By Jensen’s inequality (see Hardy et al 1988, Theorem 90, p. 74), the aggregation by the arithmetic mean is an upper bound for the Z-score aggregation if the logit function is convex, and a lower bound if the logit function is concave. The logit function is strictly increasing and has an inflection point at zero. As a consequence, all estimated default probabilities smaller than 0.5 are transformed within the convex part of the logit function. Having a vast majority of low default probabilities implies that the Z-scores are transformed mainly within the convex part of the logit function. This is a possible explanation for the similarity of outcomes between the geometric mean and the Z-score aggregation, because the Z-score aggregation works as a correction quite similar to the geometric mean as described above, especially in the B scenarios. The A scenarios contain a bigger portion of default probabilities (see Table 1) that are transformed within the concave part. This could be an explanation for the fact that nearly all A scenarios show higher mean absolute deviations with the Z-score aggregation than with the arithmetic mean.

Finally, it should be mentioned that all the above simulations were performed with 100 replications as well. Basically, all the results remain unchanged in comparison with the detailed reported setting of 1000 Monte Carlo replications used above.

5 Conclusion

From the theoretical considerations and simulations above, there is no preferable strategy for getting a more precise rating by aggregating single ratings. Using neither the single ratings nor their aggregated forms necessarily leads to improvements. Only the consensus rating provides an advantage herein, which is expected from the theoretical considerations. But the implementation of a consensus rating remains a theoretical issue, as already described. Comparing only the “real” aggregated ratings, the aggregation by the geometric mean performs slightly better than the arithmetic mean, but this may be due to the problem of overestimation of low default probabilities in the simulation framework used here. Therefore, a generalization seems not to be appropriate. With respect to the error rates, aggregating the ratings via the Z-score offers a similar performance, like the geometric mean. But in the case of default probabilities above 0.5, the mean absolute deviations overrun those of the arithmetic mean as well as the geometric mean. Such effects are probably driven by the relations between the different mean concepts in combination with the properties of the logit function. For practical purposes, default probabilities above 0.5 are not so important. Therefore, we would recommend aggregating with the geometric mean. But, as already said, a generalization is not indicated due to the outcomes above.

Every combination of a model, estimation method and arbitrary aggregation method causes different outcomes and shows its individual interplay. Further, the quality measures used here (mean absolute deviation, error rate, underestimation rate) do not indicate any practical use of the aggregated estimated default probabilities. Taking into account any time dynamics, as is considered in Grün et al (2013), the whole issue becomes more complicated. Based on this one-period simulation study, it cannot, at the moment, be expected to match better results in the sense of preciseness in a multi-period case. As a consequence, from a theoretical point of view and looking at the simulation results, the aggregation of single ratings to get a higher precision seems to be a questionable issue, at least in the case of the logit model.

Further research could concentrate on a formal proof of the idea that achieving a higher precision by means of rating aggregation is an impossible issue. In contrast, researchers could look to (dis)prove that this is possible. In doing so, some different levels of consideration could be taken into account. First, does the probability theory provide any approach to (dis)prove the reasonable combination of different conditional probabilities resulting from different models? The second level is the estimation problem. Does the estimation theory provide any approach to (dis)prove the reasonable merging of different estimates from different estimation methods?

As long as there is no proof that achieving a higher precision by means of rating aggregation is an impossible issue, further research could concentrate on studying other models and methods for aggregation. For instance, machine learning techniques such as artificial neural networks or some reasonable weighted mean could come into consideration. Additionally, it could be investigated if there is any possibility of improving the aggregation methods based on the arithmetic mean, the geometric mean or the Z-score, as used in this contribution, or if there is some local outperforming. Here, the question is if there generally are situations where one aggregation method works best.

Further, the development of alternative measures to evaluate the aggregation would be another interesting aspect to study, because the common measures from credit scoring are not appropriate with respect to the precision of aggregated ratings. The bottom line is this: beside the theoretical perspective, in which the rating aggregation seems to be questionable, finding an appropriate heuristic for rating aggregation that provides outcomes with sufficient improvement for practical use has not been precluded.

Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Acknowledgements

We thank Stefan Huschens and two anonymous referees for valuable comments.

References

Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, New York.

Grün, B., Hofmarcher, P., Hornik, K., Leitner, C., and Pichler, S. (2013). Deriving consensus ratings of the big three rating agencies. The Journal of Credit Risk 9(1), 75–98 (http://doi.org/brnk).

Hardy, G., Littlewood, J. E., and Pólya, G. (1988). Inequalities, 2nd edn. Cambridge University Press.

Lehmann, C., and Tillich, D. (2016). Consensus information and consensus rating: a note on methodological problems of rating aggregation. In Operations Research Proceedings 2014: Selected Papers of the Annual International Conference of the German Operations Research Society (GOR), RWTH Aachen University, Germany, September 2–5, 2014, Lübbecke, M., Koster, A., Letmathe, P., Madlener, R., Peis, B., and Walther, G. (eds), pp. 357–362. Springer (http://doi.org/brnm).

Mays, E. (2004). Scorecard development. In Credit Scoring for Risk Managers: The Handbook for Lenders, Mays, E. (ed), Chapter 4, pp. 63–89. Thomson South-Western, Mason, OH.

R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.

Tasche, D. (2008). Validation of internal rating systems and PD estimates. In The Analytics of Risk Model Validation, Christodoulakis, G., and Satchell, S. (eds), Chapter 11. Academic Press, London.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to copy this content. Please contact info@risk.net to find out more.

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here