Journal of Risk Model Validation
ISSN:
1753-9579 (print)
1753-9587 (online)
Editor-in-chief: Steve Satchell
Consensus information and consensus rating: a simulation study on rating aggregation
Need to know
- The aggregation of single ratings to get a higher precision seems to be a questionable issue, at least in the case of the logit model.
- Only the consensus rating provides an advantage referring the precision.
- Up to now, there is no preferable strategy to get a more precise rating by aggregating single ratings
Abstract
ABSTRACT
The aggregation of different single ratings into a so-called consensus rating in order to get a higher-precision debtor's default probability is an idea that is hardly discussed in the literature. In their 2013 paper "Deriving consensus ratings of the big three rating agencies", Grün et al came up with a method for rating aggregation, whereby the term "consensus rating" was introduced. To sharpen the whole issue of rating aggregation from a theoretical perspective, in their 2016 paper "Consensus information and consensus rating: a note on methodological problems of rating aggregation", Lehmann and Tillich developed a framework in which the terms "consensus rating" and "consensus information" are clearly defined. The paper at hand tries to connect the two aforementioned contributions and applies the theoretical framework of Lehmann and Tillich in connection with some of the practical ideas of Grün et al. In contrast to Grün et al, a simulation approach is chosen in order to have a clear benchmark for assessing the rating aggregation outcomes. Thereby, the following questions should be clarified. Does rating aggregation really lead to a higher precision of the estimated default probabilities? Is there a preferable aggregation method? Does the consensus rating, as defined by Lehmann and Tillich, outperform other aggregation methods? The simulation results show that rating aggregation could be a puzzling issue.
Introduction
The problem of rating aggregation seems not to have be discussed too much in the literature of the past few years. One of the first papers in this field is Grün et al (2013), in which an approach is developed that combines different ratings into a so-called consensus rating. In contrast, Lehmann and Tillich (2016) discuss the issue from a slightly more basic perspective. They asked what a “consensus rating” is and in which cases this concept makes sense. In Grün et al (2013), the suggested model for rating aggregation is applied to a real data set. Thereby, the main problem is that the true default probabilities are unknown and the benchmark chosen is only data driven. The approach of the contribution at hand is a simulation, whereby an artificial world with known default probabilities (based on a logit model) is constructed, ie, a fixed benchmark is used in contrast to Grün et al (2013). As already mentioned in Lehmann and Tillich (2016), there are many aspects leading to different estimates of the same default probability, eg, different models, estimation methods, data sets and time horizons. Further, the concept of consensus rating seems to be a theoretical issue, because rating agencies strive to get more precise information than their competitors, and the interchange and recombination of information in particular is not desired. Using a simulation, as is done in this paper, contains the possibility to control these manifold aspects. As a consequence, it becomes possible to gain insights into the performance of different aggregation approaches.
This paper is organized as follows. In Section 2, some notation and theoretical background is introduced in brief. Section 3 contains the settings of the simulation and the methods of rating aggregation. Section 4 presents the results. Finally, the paper ends with some conclusions in Section 5.
2 Information set and consensus information
For the following notation and assumptions, see Lehmann and Tillich (2016), Sections 1 and 2. The credit default of debtor is modeled by a random variable , . It takes a value of when debtor defaults, and 0 otherwise. Thus, is the unconditional default probability of debtor . In order to estimate individual default probabilities, typically several rating characteristics are taken into account. These rating characteristics are modeled by a subject-specific real random vector
with realization
Then, the probability of interest in a rating process is the conditional default probability .
In the following, an institution that assigns ratings is called a rating agency. This could also be a bank that evaluates the debtors. In the style of Lehmann and Tillich (2016), it is assumed that all rating agencies are using the same vector . This assumption is needed in the following for reasonable set operations. In this framework, the differences do not lie in the rating characteristics themselves, but in the information about them.
The situation in which all the values of debtor are known is called complete information. The corresponding information set is . It is a singleton. Typically, rating agencies do not have complete information. They know and use only subvectors of . The subvector belonging to rating agency is denoted by . Its corresponding information set differs from the complete information set in that the unknown values are replaced by the set of real numbers. It is assumed that a rating agency either knows the exact value of a rating characteristic or knows nothing about it. It follows that . The more information that is available, the more precise the information set. Note that, within our framework, higher precision in an information set is connected with lower cardinality.
Example 2.1.
Because of their different information sets, rating agencies estimate different conditional default probabilities . In order to estimate the same probability, and to get as close to complete information as possible, the information sets of the rating agencies should be merged in this framework. This leads to a consensus in information. Since it is assumed that there is no contradictory information, the consensus information set of the th debtor can be defined as (see Lehmann and Tillich (2016), Definition 1)
(2.1) |
Thus, the consensus information set is at least as precise as every single agency-specific information set . As already noted above, higher precision in an information set is connected with lower cardinality. It holds that for all , since is included in every single agency-specific information set . Based on the consensus information set , the subvector can be constructed. It contains all values of the rating characteristics that are known to at least one rating agency.
Example 2.2.
Let
as in Example 2.1, and | ||||
Then the consensus information set corresponding to the information sets of the two rating agencies is
and is the resulting subvector of the rating characteristics.
3 Simulation
Every information set leads to a conditional default probability, eg, . A rating agency calculates an estimate for this unknown default probability. Rating aggregation typically means aggregating such different estimates. To get a consensus rating in the sense of estimating the same target, rating agencies would have to use the same information set, eg, the consensus information set. In this section, some different aggregation variations as well as the concept of consensus rating are simulated and compared. Keeping in mind the theoretical considerations from above, it should be clarified whether a real consensus rating performs better than other forms of rating aggregation.
The basic idea for the simulation is as follows. First, an artificial world is constructed by a complete specified logit model, in which all real default probabilities are known. Second, defaults are simulated based on these known probabilities. Third, on this data set, the default probabilities are estimated, given different subsets of the information set that is used at the beginning. Fourth, and finally, the estimated probabilities are aggregated in different ways. Thus, a comparison between the different estimated probabilities and the “true” probabilities is possible. The second to fourth steps are replicated in a Monte Carlo simulation. As a basis of the simulation, the logit function is needed. It is defined as
Its inverse is
The logit model is chosen here because it is the most-used standard model for scoring purposes (see Anderson (2007, p. 42) or Mays (2004, p. 65)).
In the following, all steps are described in detail.
Step 1. At first, the rating characteristics and the corresponding default probability for all of the debtors are needed. To this end, realizations of independent and identically distributed (iid) random vectors
are generated with an arbitrary distribution . Thus, . Next, the true default probability for debtor is calculated by
(3.1) |
where and denote the known parameters.
Step 2. Based on the true default probability , one realization of the default variable is simulated for every debtor .
Step 3.
- (a)
Based on the default data from Step 2, the parameters of (3.1) are estimated. With the estimates and , the default probability is estimated by
- (b)
Additionally, some modifications of (3.1) are estimated. The modifications refer to the consideration of different rating agency-specific information sets . In detail, the rating agencies know and/or use only subvectors of . These subvectors have a dimension . From these subvectors and the default data, a reduced number of parameters is estimated, namely and . With the estimates and , the agency-specific default probabilities
(3.2) are estimated by
(3.3) - (c)
Last, the consensus information set from (2.1) is used for estimation. From the corresponding subvectors and the default data, the default probabilities are estimated by , analogously to Step 3(b).
All estimations are done by maximum likelihood estimation.
Step 4. For all debtors , aggregated default probabilities are derived from the rating agencies’ estimates , , in (3.3). This is the investor’s or external perspective, which means no information about the rating characteristics is used.
- (a)
An aggregated default probability is calculated as an arithmetic mean:
- (b)
Another form of aggregation is the geometric mean
- (c)
Taking into account the benchmark idea of Grün et al (2013, p. 82), a third aggregation method based on the so-called Z-scores is used. Formally, the Z-scores are simply the estimated linear predictors of the logit model in (3.3), ie, . Calculating their arithmetic mean
finally leads to the corresponding aggregated default probability .
Steps 2–4 are replicated times, ie, defaults are simulated for each individual (based on the default probability from Step 1). In contrast to Grün et al (2013), there are no time dynamics considered; it is a simulation only for one period of time, and this is replicated independently times. Typically, one needs to distinguish between point-in-time (PIT) and through-the-cycle (TTC) default probabilities. Basically, the whole aggregation issue can be applied to PIT ratings as well as TTC ratings. The only requirement therein is a need for the same time horizon for all single ratings being aggregated. The resulting rating aggregation then refers to the same time horizon as all the single ratings. From Step 1 in the simulation setting, it follows that all debtors are independent. Based on a typical assumption in credit risk, this can be interpreted as a PIT framework, where the defaults are stochastically independent conditional on an unobservable, systematic risk factor. Actually, the distinction between PIT and TTC is an issue of dependency between different debtors, but not between different single ratings referring to one debtor. These single ratings are, of course, highly dependent.
In the sense of Lehmann and Tillich (2016, Section 3), the estimated probabilities from Step 3(c) constitute “consensus ratings”, because they are based on the consensus information set. In contrast, the aggregated estimates , and are based on generally different information sets . Hence, they should be denoted as “compromise ratings”, as mentioned in Lehmann and Tillich (2016, Section 4).
There are three main advantages of the simulation framework above.
- (1)
The true default probability as a fixed benchmark is known.
- (2)
In the artificial world, a real consensus information in the sense of (2.1) is possible, and therewith a real consensus rating exists.
- (3)
The aggregated ratings are based on the same model, estimation method, data set and time horizon. Only the information sets differ. The main interest within this contribution lies in the role of the information set for aggregated ratings. Therefore, the simulation scenarios only differ with regard to the used information set, while the whole framework around them is fixed. As already mentioned in Lehmann and Tillich (2016, pp. 361–362), different models, estimation methods and data sets lead to different estimates of default probabilities. These effects may interfere with the influence of the information set on the rating aggregation.
Simulations as described above are performed with settings as follows.
- •
Number of rating agencies: .
- •
Number of debtors: .
- •
Number of Monte Carlo replications: .
All simulations were implemented with GAUSS, Version 16.0.1 (seed for random numbers: 1 664 525) using the MAXLIK package, Version 5.0.9. The graphics were generated with R, Version 3.1.2 (R Core Team 2014), and the package ggplot2, Version 1.0.1.
Inspired by Grün et al (2013), who modeled rating aggregation of the three big rating agencies (Standard & Poor’s (S&P), Moody’s and Fitch), we consider three rating agencies as well. Two basic scenarios, A and B, of simulations are investigated. Their specifications are given in columns 2 and 3 of Tables 2 and 3. Both scenarios contain the same types of distributions for the regressor variables, namely lognormal, Poisson and Bernoulli distributions. The lognormal distribution producing only positive real numbers could indicate income, for example. The Poisson distribution produces nonnegative integers and could stand for the number of loans or the number of credit cards a person already has. Finally, the Bernoulli distribution with values and indicates some dichotomous characteristic, like sex. The coefficients are set in this way to get two very different scenarios for the generated default probabilities. Scenario A contains much higher default probabilities than scenario B. Thus, scenario B forms the more practical scenario, especially in the case of credit defaults. Nonetheless, in the face of a crisis, scenario A seems not to be so absurd. Histograms of the true default probabilities for A and B, as calculated in Step 1, are illustrated in Figure 1. Additionally, Table 1 provides some descriptive statistics.
For both scenarios A and B of default probabilities, four subscenarios are considered. The subscenarios 1–4 differ in the information sets of the rating agencies and therefore in the resulting consensus information set. The rating characteristics used for estimating can be read in the “Choice matrix” column of Tables 2 and 3, as follows. Every column of the choice matrix stands for one of the rating characteristics that are used, ie, every column symbolizes one regressor variable from the logit model in (3.1). Thereby, the first column represents the intercept (this is the coefficient ), and columns 2–5 represent the rating characteristics (with the coefficients ). Every row of the matrix stands for one of the model modifications in (3.1) and (3.2). More precisely, the first row represents the “true-world” model; the second, third and fourth rows represent the different rating agencies; and the fifth row contains the case of the consensus information (see Step 3(c)). Thereby, the choice matrix only contains ones and zeros, which indicate whether the th regressor variable is used (1) or not (0) within the considered model modification. Within the framework from Section 2, for every regressor variable not used by rating agency , it holds that . Therefore, the regressor variable can be omitted in the estimation. In the case of used regressor variables, it holds that , ie, the rating agency knows the correct value of debtor .
Descriptive statistics of | ||
---|---|---|
true default probabilities | Scenario A | Scenario B |
25% quantile | 1.069E02 | 5.472E06 |
Median | 7.789E02 | 1.215E04 |
75% quantile | 3.273E01 | 1.567E03 |
90% quantile | 6.764E01 | 9.175E03 |
4 Results
In the following, comparisons are made between the results of the different approaches from above. At first, the plausibility of the estimated values is assessed by error rates. The error rates describe the portion of debtors for which the true default probability does not lie between the minimum and maximum of the estimated default probabilities over all Monte Carlo replications. For formal notation, let denote the number of elements of an arbitrary set . Then, the error rates are defined as follows, :
with
where the term indicates the corresponding value of the th Monte Carlo replication. For instance, denotes the estimated default probability of debtor in Monte Carlo replication using the consensus information set .
Error | |||||
---|---|---|---|---|---|
Rating | Error | rate | |||
Scenario | characteristics | Choice matrix | rates | ||
A1 | , , . [4pt] All regressor variables are mutually independent. | ||||
A2 | |||||
A3 | |||||
A4 |
Error | |||||
---|---|---|---|---|---|
Rating | Error | rate | |||
Scenario | characteristics | Choice matrix | rates | ||
B1 | , , . [4pt] All regressor variables are mutually independent. | ||||
B2 | |||||
B3 | |||||
B4 |
Remark 4.1.
The measures of scoring quality that are used most often include the accuracy ratio, area under the curve, Kolmogorov–Smirnov statistic or conditional information entropy ratio. But these measures focus on discriminatory aspects rather than on the precision of default probabilities (see Tasche 2008), which is the main target of this contribution. Thus, we use measures that aim at precision.
In the following, some important results from and remarks on Tables 2 and 3 are reported.
- (1)
Basically, all the results of aggregated ratings are quite unsatisfactory, due to the obtained error rates. Error rates of about 80% are quite high, and such outcomes seem not to be appropriate for practical use. Additionally, note that an error rate as defined above constitutes a quite rough measure of quality.
- (2)
Regarding the aggregation method, neither the arithmetic mean, the geometric mean nor the Z-score mean seems to be preferable in scenarios A1–A4.
In A4, the smallest amount of information about the rating characteristics is used; in particular, the consensus information (six out of nine regressor variables) is not identical to the complete information as it is in scenarios A2 and A3. The property is also valid in scenario A1. But, interestingly, in A1 the error rate is essentially lower than in A4. This illustrates the influence of single regressor variables on the linear predictor in the logit model. The influence of single regressor variables on the linear predictor and the outcome also can be seen in A3, where only one variable is missing for every rating agency. These little differences lead to very different error rates, especially referring to . These varying outcomes depend strongly on the type of regressor variables as well as their interaction with the linear predictor.
- (3)
The error rates of the B scenarios are remarkably lower than in the A scenarios. In all B scenarios (except for in B1), the single ratings perform better than the aggregation with the arithmetic mean. Because of the complex model structure of the logit model, it is very difficult to explain this phenomenon from a theoretical perspective based on the model background. Further, the extent of the simulations above is too small to derive any essential tendency based on the empirical outcome.
Using the geometric mean or the Z-score for aggregation leads to lower error rates than the arithmetic mean, but it does not lead to an improvement against the single ratings in every case (see B4).
Calculating only the error rates does not take into account any considerations referring to the preciseness of the results. Having small error rates is not enough. The calculated default probability should come along with a small distance to the true default probability. This is assessed by the mean distance between the true probability and the corresponding estimated or aggregated probability:
(4.1) | ||||
(4.2) | ||||
(4.3) | ||||
(4.4) | ||||
(4.5) |
where the term again indicates the th Monte Carlo replication. Because this contribution is on aggregation issues, the corresponding quantities for single rating agencies are omitted. Tables 4 and 5 contain an overview of the descriptive statistics for the mean absolute deviations (4.1)–(4.5) for all simulation scenarios.
Descriptive | ||||||
---|---|---|---|---|---|---|
statistics | ||||||
A1 | 1st quartile | 1.892E03 | 6.410E03 | 2.356E02 | 2.143E02 | 2.955E02 |
Median | 8.944E03 | 1.942E02 | 6.356E02 | 6.199E02 | 7.553E02 | |
3rd quartile | 2.073E02 | 6.856E02 | 1.890E01 | 1.638E01 | 1.977E01 | |
A2 | 1st quartile | 1.892E03 | 1.892E03 | 1.601E02 | 1.352E02 | 1.764E02 |
Median | 8.944E03 | 8.944E03 | 5.150E02 | 4.267E02 | 5.404E02 | |
3rd quartile | 2.073E02 | 2.073E02 | 1.384E01 | 1.112E01 | 1.383E01 | |
A3 | 1st quartile | 1.892E03 | 1.892E03 | 1.126E02 | 6.522E03 | 1.136E02 |
Median | 8.944E03 | 8.944E03 | 2.996E02 | 2.164E02 | 2.545E02 | |
3rd quartile | 2.073E02 | 2.073E02 | 7.388E02 | 5.596E02 | 5.092E02 | |
A4 | 1st quartile | 1.892E03 | 1.778E02 | 3.453E02 | 2.932E02 | 3.860E02 |
Median | 8.944E03 | 6.087E02 | 1.208E01 | 1.101E01 | 1.401E01 | |
3rd quartile | 2.073E02 | 1.905E01 | 2.194E01 | 2.012E01 | 2.403E01 |
Descriptive | ||||||
---|---|---|---|---|---|---|
statistics | ||||||
B1 | 1st quartile | 1.114E05 | 1.114E05 | 9.257E05 | 2.957E05 | 4.300E05 |
Median | 1.385E04 | 1.385E04 | 4.958E04 | 2.248E04 | 3.180E04 | |
3rd quartile | 1.090E03 | 1.090E03 | 2.085E03 | 1.245E03 | 1.645E03 | |
B2 | 1st quartile | 1.114E05 | 1.114E05 | 7.891E04 | 2.079E04 | 3.000E04 |
Median | 1.385E04 | 1.385E04 | 1.867E03 | 7.203E04 | 1.103E03 | |
3rd quartile | 1.090E03 | 1.090E03 | 4.267E03 | 1.628E03 | 2.450E03 | |
B3 | 1st quartile | 1.114E05 | 1.114E05 | 1.191E03 | 3.337E04 | 4.820E04 |
Median | 1.385E04 | 1.385E04 | 2.734E03 | 1.103E03 | 1.616E03 | |
3rd quartile | 1.090E03 | 1.090E03 | 5.521E03 | 2.295E03 | 3.419E03 | |
B4 | 1st quartile | 1.114E05 | 2.960E05 | 2.695E04 | 1.325E04 | 1.910E04 |
Median | 1.385E04 | 3.218E04 | 1.074E03 | 7.159E04 | 1.013E03 | |
3rd quartile | 1.090E03 | 2.084E03 | 6.023E03 | 3.656E03 | 5.493E03 |
In the following, some important results from and remarks on Tables 4 and 5 are reported.
- (1)
- (2)
The mean absolute deviations in the A scenarios are bigger than in the B scenarios, which is probably mainly caused by the different dimensions of the true default probabilities (see Figure 1). In comparison with the estimation based on the complete information, the mean deviations (referring to the corresponding quantiles of the descriptive statistics) are higher mainly up to factor 12 in the A scenarios, and up to factor 7 in the B scenarios. In relation to the very small default probabilities, this is not necessarily a problem if the estimated default probabilities are transformed into an ordinal scale with rating classes. Thereby, such small default probabilities typically constitute the best rating class.
- (3)
Out of all the A scenarios, A4 contains the biggest deviations, regardless of the aggregation method. In A4, the smallest amount of information is used; in particular, the consensus information (six out of nine regressor variables) is not identical to the complete information like in scenarios A2 and A3. Also in scenario A1, the consensus information does not equal the complete information.
- (4)
Referring to the aggregation, the A scenarios as well as the B scenarios show higher absolute deviations for the arithmetic mean and the Z-score aggregation in comparison with the geometric mean. The Z-score aggregation shows higher mean absolute deviations than the arithmetic mean in the A scenarios, and smaller ones than the arithmetic mean in the B scenarios.
The absolute values of deviation still hide the direction of the error that is made. In order to get some insight into this issue, the rates of underestimated default probabilities are investigated. This underestimation rate is defined as
with . Analogously, the underestimation rates , , and are defined with the sets
Figure 2 illustrates all these underestimation rates by means of hexbinplots. Hexbinplots are a kind of aggregated scatterplot. Where there is no data, there is no hexbin. The more data there is in a hexbin, the darker it is. Since the main target here is insight into aggregation issues, the single rating agencies are not considered for the problem of underestimation. Apart from that, the graphs for the single rating agencies do not provide any additional information on or new insights into the basic problem of aggregation.
For the hexbinplots in Figure 2, the following observations can be made.
- (1)
Note the difference between the empirical distributions of the true default probabilities in the A and B scenarios for the following remarks. A large number of low default probabilities leads to high absolute frequencies near zero along the horizontal axes. The main target of the hexbinplots is the illustration of the underestimation rates depending on the true default probabilities. Therefore, every hexbinplot has to be read mainly in a “vertical” direction.
- (2)
If there is no tendency of overestimation or underestimation, the empirical distributions of the underestimation rates are mainly concentrated around 0.5. This can be seen if the models are estimated with complete information ().
- (3)
In the A and B scenarios, there is a tendency to overestimate the low default probabilities (underestimation rates near zero) and underestimate the higher default probabilities (underestimation rates near one). When default probabilities are near the bounds of the interval , there is a larger amount of possible probability estimates that overestimate low probabilities and underestimate high probabilities.
- (4)
From the aggregation perspective, there seems to be no preferable kind of aggregation regarding under- or overestimation. The plots for and are almost equal for all scenarios. If consensus information and complete information are not identical, ie, (see scenarios A1, A4 and B4), the plot for is also quite similar to them.
- (5)
As mentioned above, nearly all scenarios show higher absolute deviations and higher error rates for the arithmetic mean than the geometric mean. This fact becomes clear when we take a look at the underestimation rates. For positive values, the arithmetic mean is an upper bound for the geometric mean (see Hardy et al 1988, Theorem 9, p. 17). If there is a tendency to overestimate the low default probabilities, which form the majority in the scenarios here (see Figure 1), an aggregation by the geometric mean acts like a correction under such conditions. As a consequence, this is one possible explanation for the corresponding lower error rates in Tables 2 and 3.
- (6)
The outcome of the aggregation based on the Z-score depends on the properties of the logit function in (3.1). Because of its S-shaped form, the logit function has a convex part and a concave part as well. By Jensen’s inequality (see Hardy et al 1988, Theorem 90, p. 74), the aggregation by the arithmetic mean is an upper bound for the Z-score aggregation if the logit function is convex, and a lower bound if the logit function is concave. The logit function is strictly increasing and has an inflection point at zero. As a consequence, all estimated default probabilities smaller than 0.5 are transformed within the convex part of the logit function. Having a vast majority of low default probabilities implies that the Z-scores are transformed mainly within the convex part of the logit function. This is a possible explanation for the similarity of outcomes between the geometric mean and the Z-score aggregation, because the Z-score aggregation works as a correction quite similar to the geometric mean as described above, especially in the B scenarios. The A scenarios contain a bigger portion of default probabilities (see Table 1) that are transformed within the concave part. This could be an explanation for the fact that nearly all A scenarios show higher mean absolute deviations with the Z-score aggregation than with the arithmetic mean.
Finally, it should be mentioned that all the above simulations were performed with 100 replications as well. Basically, all the results remain unchanged in comparison with the detailed reported setting of 1000 Monte Carlo replications used above.
5 Conclusion
From the theoretical considerations and simulations above, there is no preferable strategy for getting a more precise rating by aggregating single ratings. Using neither the single ratings nor their aggregated forms necessarily leads to improvements. Only the consensus rating provides an advantage herein, which is expected from the theoretical considerations. But the implementation of a consensus rating remains a theoretical issue, as already described. Comparing only the “real” aggregated ratings, the aggregation by the geometric mean performs slightly better than the arithmetic mean, but this may be due to the problem of overestimation of low default probabilities in the simulation framework used here. Therefore, a generalization seems not to be appropriate. With respect to the error rates, aggregating the ratings via the Z-score offers a similar performance, like the geometric mean. But in the case of default probabilities above 0.5, the mean absolute deviations overrun those of the arithmetic mean as well as the geometric mean. Such effects are probably driven by the relations between the different mean concepts in combination with the properties of the logit function. For practical purposes, default probabilities above 0.5 are not so important. Therefore, we would recommend aggregating with the geometric mean. But, as already said, a generalization is not indicated due to the outcomes above.
Every combination of a model, estimation method and arbitrary aggregation method causes different outcomes and shows its individual interplay. Further, the quality measures used here (mean absolute deviation, error rate, underestimation rate) do not indicate any practical use of the aggregated estimated default probabilities. Taking into account any time dynamics, as is considered in Grün et al (2013), the whole issue becomes more complicated. Based on this one-period simulation study, it cannot, at the moment, be expected to match better results in the sense of preciseness in a multi-period case. As a consequence, from a theoretical point of view and looking at the simulation results, the aggregation of single ratings to get a higher precision seems to be a questionable issue, at least in the case of the logit model.
Further research could concentrate on a formal proof of the idea that achieving a higher precision by means of rating aggregation is an impossible issue. In contrast, researchers could look to (dis)prove that this is possible. In doing so, some different levels of consideration could be taken into account. First, does the probability theory provide any approach to (dis)prove the reasonable combination of different conditional probabilities resulting from different models? The second level is the estimation problem. Does the estimation theory provide any approach to (dis)prove the reasonable merging of different estimates from different estimation methods?
As long as there is no proof that achieving a higher precision by means of rating aggregation is an impossible issue, further research could concentrate on studying other models and methods for aggregation. For instance, machine learning techniques such as artificial neural networks or some reasonable weighted mean could come into consideration. Additionally, it could be investigated if there is any possibility of improving the aggregation methods based on the arithmetic mean, the geometric mean or the Z-score, as used in this contribution, or if there is some local outperforming. Here, the question is if there generally are situations where one aggregation method works best.
Further, the development of alternative measures to evaluate the aggregation would be another interesting aspect to study, because the common measures from credit scoring are not appropriate with respect to the precision of aggregated ratings. The bottom line is this: beside the theoretical perspective, in which the rating aggregation seems to be questionable, finding an appropriate heuristic for rating aggregation that provides outcomes with sufficient improvement for practical use has not been precluded.
Declaration of interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.
Acknowledgements
We thank Stefan Huschens and two anonymous referees for valuable comments.
References
Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, New York.
Grün, B., Hofmarcher, P., Hornik, K., Leitner, C., and Pichler, S. (2013). Deriving consensus ratings of the big three rating agencies. The Journal of Credit Risk 9(1), 75–98 (http://doi.org/brnk).
Hardy, G., Littlewood, J. E., and Pólya, G. (1988). Inequalities, 2nd edn. Cambridge University Press.
Lehmann, C., and Tillich, D. (2016). Consensus information and consensus rating: a note on methodological problems of rating aggregation. In Operations Research Proceedings 2014: Selected Papers of the Annual International Conference of the German Operations Research Society (GOR), RWTH Aachen University, Germany, September 2–5, 2014, Lübbecke, M., Koster, A., Letmathe, P., Madlener, R., Peis, B., and Walther, G. (eds), pp. 357–362. Springer (http://doi.org/brnm).
Mays, E. (2004). Scorecard development. In Credit Scoring for Risk Managers: The Handbook for Lenders, Mays, E. (ed), Chapter 4, pp. 63–89. Thomson South-Western, Mason, OH.
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
Tasche, D. (2008). Validation of internal rating systems and PD estimates. In The Analytics of Risk Model Validation, Christodoulakis, G., and Satchell, S. (eds), Chapter 11. Academic Press, London.
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net