On modeling zero-inflated insurance data

J. M. Pérez Sánchez; E. Gómez-Déniz

There are a lot of applications involving discrete data for which the observed data shows a zero-observing frequency that is significantly higher than that predicted by the assumed model. The problem of observing a high proportion of zeros has been of interest in data analysis and modeling in many fields, such as medicine, engineering applications, manufacturing, economics, public health, road safety, epidemiology and, in particular, actuarial data. Models with a significantly higher number of zeros are known as zero-inflated models. Poisson regression models provide a standard framework for the analysis of count data. However, count data is often overdispersed relative to the Poisson distribution. One frequent factor of overdispersion is that the incidence of zero counts is greater than expected for the Poisson distribution; this is of interest, because zero counts frequently have special issue. For example, in counting claims from policyholders, a policyholder may have no claims either because they are a good driver or simply because no risk factors have happened “near” their driving. This is the distinction between structural zeros, which are (almost) inevitable, and sampling zeros, which occur by chance. On the other hand, as has been pointed out by Denuit et al (2009), overdispersion leads to underestimates of standard errors and overestimates of chi-squared statistics. This could lead to serious consequences. For example, some explanatory variables may become not significant after overdispersion has been accounted for.

Over the last few decades, there has been considerable interest in models for count data that allow for excess zeros, particularly in the econometric literature. Mullahy (1986) explores the specification and testing of some modified count data models. Lambert (1992) provides a manufacturing defects application of these models and discusses the case of zero-inflated Poisson (ZIP) models. Gupta et al (1996) provide a general analysis of zero-inflated models. Gurmu (1997) develops a semi-parametric estimation method for hurdle (two-part) count regression models. Ridout et al (1998) consider the problem of modeling count data with excess zeros and review some possible models. Hall (2000) adapts Lambert’s methodology to an upper-bounded count situation, thereby obtaining a zero-inflated binomial (ZIB) model. Ghosh et al (2006) introduce a flexible class of zero-inflated models that includes other familiar models, such as ZIP models, as special cases by using a Bayesian estimation method. An overview of count data in econometrics including zero-inflated models is provided in Cameron and Trivedi (1998). In insurance, Yip and Yau (2005) provide a better fit to their insurance data by using zero-inflated count models. Boucher et al (2007) revise zero-inflated and hurdles models with applications to a Spanish insurance company. More recently, Mouatassim and Ezzahid (2012) analyzed zero-inflated models with an application to private health insurance data.

In this paper, we use power series distributions to develop a novel and flexible zero-inflated Bayesian methodology. We employ sampling-based methods in order to model an automobile insurance data set. This model leads us to incorporate the presence of an excessive number of zero counts and overdispersion phenomena. The Bayesian approach allows model validation to be conducted with the posterior distribution, ie, by taking prior beliefs and the sample data into account. Recently, this methodology has been used as an alternative to traditional approaches to validating models in credit risk portfolios (see Jacobs 2015; Parnes 2015). The validation process of a model includes a comparison with other standard and Bayesian models by analyzing the significant factors and some information criteria, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the deviance information criterion (DIC). This last one is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation, as occurs in our model (see Spiegelhalter et al (2002, 2014) for further details).

The structure of this paper is as follows. Section 2 provides the zero-inflated power series distributions and the new Bayesian model proposed here. Section 3 looks at automobile insurance application, and Section 4 briefly concludes.

2 Modeling zero-inflated data

It is known that power series distributions form a useful subclass of one-parameter discrete exponential families suitable for modeling count data. Since the original works of Kosambi (1949) and Noack (1950), the power series distribution has been very popular in the statistical literature dealing with discrete distributions that belong to this simple class. Two references concerning these features are Patil (1962a, 1962b). A revision of the power series distribution can be viewed in Johnson et al (2005), Chapter 2.

The probability function of the power series distribution becomes

\Pr(X=x)=\frac{b(x)\theta^{x}}{f(\theta)},\quad x=0,1,\dots,

(2.1)

where $b(x)\theta^{x}\geq 0$ , $b(x)$ is a function of $x$ or constant, $f(\theta)=\sum_{x=0}^{\infty}b(x)\theta^{x}$ is convergent and $\theta>0$ is the power parameter of the distribution. The family of discrete distributions defined in (2.1) includes a broad class of known distributions, including the Poisson, binomial, negative binomial, logarithmic series and Conway–Maxwell–Poisson distributions. After computing the probability generating function, which is given by $G_{X}(z)=f(z\theta)/f(\theta)$ , $|z|\leq 1$ , it is easy to see that the mean and variance of the power series distribution are as follows:

	$\displaystyle E(X)$	$\displaystyle=\mu=\frac{\theta f^{\prime}(\theta)}{f(\theta)},$		(2.2)
	$\displaystyle\operatorname{var}(X)$	$\displaystyle=\sigma^{2}=\frac{\theta^{2}f^{\prime\prime}(\theta)}{f(\theta)}+% \mu(1-\mu).$		(2.3)

Thus, the index of dispersion,

\mathrm{ID}=\frac{\sigma^{2}}{\mu}=1+\frac{\theta f^{\prime\prime}(\theta)}{f^% {\prime}(\theta)}-\mu,

(2.4)

accommodates for overdispersion when $(\theta f^{\prime\prime}(\theta)/f^{\prime}(\theta))-\mu>0$ . For example, when the Poisson distribution is considered, we have that $f(\theta)=\exp(\theta)$ and $\mathrm{ID}=0$ , ie, we get equidispersion. If $f(\theta)=(1-\theta)^{r}$ , $r>0$ , the distribution in (2.1) reduces to the negative binomial distribution; from (2.4), we get that $\mathrm{ID}=1+\theta/(1-\theta)>1$ , and the overdispersion phenomenon is obtained. Observe that for the binomial and negative binomial cases the corresponding additional integer parameters, usually called $n>0$ and $r>0$ , are considered to be nuisance parameters.

Starting with a distribution belonging to the power series, a more flexible model can be constructed by adding a parameter that leads us to inflate the zero value of the empirical data when inflation of this exists. Thus, zero-inflated power series distribution contains two parameters. The first parameter $\omega$ indicates an inflation of zeros, and the other parameter $\theta$ is that of the power series distribution. A zero-inflated power series distribution is a mixture of a power series distribution and a degenerate distribution at zero, with a mixing probability $\omega$ for the degenerate distribution. As Johnson et al (2005) point out, a very simple alternative for modeling this setting is to add an arbitrary proportion of zeros, decreasing the remaining frequencies in an appropriate manner. In conclusion, zero-inflated models deal with the problem that the data displays a higher number of zeros (nonclaims in our case), and they are therefore appropriate for modeling counts that encounter disproportionately large frequencies of zeros.

If we start with a discrete distribution $\Pr(Y=y)$ , we can build a zero-inflated distribution in a simple form (see Cohen 1966) by assuming

\Pr(Y=y;\omega)=\begin{cases}\omega+(1-\omega)\Pr(Y=0),&y=0,\\ (1-\omega)\Pr(Y=y),&y\neq 0,\end{cases}

(2.5)

where $\Pr(Y=y)$ , $x=0,1,\dots$ , is the parent distribution, and

-\frac{\Pr(Y=0)}{1-\Pr(Y=0)}<\omega<1.

(2.6)

This last inequality allows the distribution to be well defined for certain negative values of $\omega$ . The downside to this representation of the support of $\omega$ instead of the usual $0\leq\omega\leq 1$ is that the mixing interpretation is lost; however, in practice, the $\omega$ parameter can incorporate negative values into the support given in (2.6), and therefore (2.5) is genuine (see, for example, Bhattacharya et al 2008). Later, we will see that this is the case for the data considered here.

So, the probability mass function of the zero-inflated power series distribution, $\mathrm{ZIPS}(w,\theta)$ , results in

\Pr(Y_{i}=y_{i};\omega)=\begin{cases}\omega+(1-\omega)\dfrac{b(0)}{f(\theta)},% &y_{i}=0,\\ (1-\omega)\dfrac{b(k)\theta^{k}}{f(\theta)},&y_{i}=k\neq 0,\end{cases}

(2.7)

where $f(\theta)=\sum_{k=0}^{\infty}b(k)\theta^{k}$ , $0\leq\omega<1$ and $\theta>0$ . The mean and variance are

	$\displaystyle E(y_{i};\omega)$	$\displaystyle=(1-\omega)\mu,$		(2.8)
	$\displaystyle\operatorname{var}(y_{i};\omega)$	$\displaystyle=(1-\omega)(\sigma^{2}+\omega\mu^{2}),$		(2.9)

where $\mu$ and $\sigma^{2}$ denote the mean and variance of the power series distributions given in (2.2) and (2.3), respectively.

Now, zero-inflated forms assuming different count distributions belonging to the power series distribution can be defined easily. Gupta et al (1996) and Ghosh et al (2006), for example, investigated the zero-inflated form of the generalized Poisson distribution.

Maximum-likelihood estimators of $\omega$ and $\theta$ can be obtained by maximizing $\log\ell(\omega,\theta;y_{i})$ , $y=1,\dots,n$ , with respect to $\omega$ and $\theta$ , where

\ell(\omega,\theta;y_{i})=\prod_{i=1}^{n}\bigg[\omega+(1-\omega)\frac{b(0)}{f(% \theta)}\bigg]^{n_{0}}\bigg[(1-\omega)\frac{b(y_{i})\theta^{y_{i}}}{f(\theta)}% \bigg]^{n-n_{0}}.

(2.10)

Here, $n$ is the sample size and $\smash{n_{0}}$ is the number of zero counts in the sample. Following Ghosh et al (2006), and using binomial expansion, the likelihood function in (2.10) can be written as

\ell(\omega,\theta;y_{i})\propto\theta^{n\bar{x}}\sum_{j=0}^{n_{0}}{\left({{n_% {0}}\atop{j}}\right)}\omega^{j}(1-\omega)^{n-j}\bigg[\frac{b(0)}{f(\theta)}% \bigg]^{n-j}.

(2.11)

After obtaining the normal equations, we have to solve

\bar{x}-\frac{\theta f^{\prime}(\theta)}{f(\theta)-b(0)}=0

to get the maximum-likelihood estimate of $\theta$ , where $\bar{x}$ is the sample mean. Once $\theta$ is obtained, the parameter $\omega$ is obtained from

\omega=\frac{(n-n_{0})f(\theta)}{n[f(\theta)-b(0)]}.

Therefore, the maximum-likelihood estimation of the parameters under the power series distribution is simple. In a similar manner, the regression coefficients when covariates are implemented in the model can also be obtained in a simple way.

2.1 Including covariates

In practice, practitioners usually use a data set with commonly available exogenous covariates in order to explain the variable $Y_{i}$ , known in this case as the endogenous variate. That is, suppose that for the $i$ th observation the covariates $x$ and $z$ are available. In order to adapt the model to this framework, we need to relate these covariates with endogenous variables via the parameters $\theta$ and $\omega$ . This can be done through the following links:

	$\displaystyle\theta_{i}=\exp(x^{\mathrm{T}}\vec{\beta}),$
	$\displaystyle\log\bigg(\frac{\omega_{i}}{1-\omega_{i}}\bigg)=z^{\mathrm{T}}% \vec{\gamma},$

with $\vec{\beta}^{\mathrm{T}}=(\beta_{1},\dots,\beta_{k})$ and $\vec{\gamma}^{\mathrm{T}}=(\gamma_{1},\dots,\gamma_{k})$ vectors of unknown regression parameters associated with covariates. Of course, in practice it is common to suppose that the design matrix $X$ and $Z$ are the same.

A nice reformulation of the zero-inflated model above was proposed recently by Ghosh et al (2006), who considered that the zero-inflated model can be represented as $Y=V(1-B)$ , where $B$ is a Bernoulli, Bernoulli( $p$ ), random variable, and $V$ independently to $B$ has a discrete distribution on the power series, $\mathrm{PS}(\theta)$ . Under this representation, the mean $E(Y)$ and variance $\operatorname{var}(Y)$ can be rewritten as

	$\displaystyle E(y)$	$\displaystyle=(1-\omega)E(V),$		(2.12)
	$\displaystyle\operatorname{var}(y)$	$\displaystyle=\frac{\omega}{1-\omega}E(y)^{2}+\delta E(y),$		(2.13)

where $\delta=\operatorname{var}(V)/E(V)$ denotes the coefficient of dispersion of the latent random variable, $V$ . If $V$ does not have an underdispersed distribution (ie, $\delta\geq 1$ ), then the distribution of $Y$ is overdispersed. On the other hand, if $V$ does have an underdispersed distribution (ie, $\delta<1$ ), then $Y$ has an underdispersed distribution if and only if $E(V)<(1-\delta)/w$ . In their paper, Ghosh et al (2006) suppose that $V$ follows the power series distribution.

In our paper, two discrete distributions belonging to the power series distributions will be considered. They are the Poisson distribution with parameter $\theta>0$ and the geometric distribution with parameter $1/(1+\theta)$ , $\theta>0$ . These models will be denoted as the ZIP model and the zero-inflated geometric (ZIG) model.

2.2 The Bayesian model

In this section, a Bayesian methodology is carried out; this allows us to estimate the model above in a simple way, facilitating the process of incorporating covariates and providing exact posterior inference up to a Monte Carlo error. This model can easily accommodate multiple continuous and categorical predictors.

From a Bayesian point of view, prior distributions for $\omega$ and $\theta$ will be required. In this sense, and looking to the loglikelihood in (2.11), it is adequate to assume a Beta prior distribution for $\omega$ and the natural conjugate prior (Ghosh et al 2006) for the power series distribution in the following way:

	$\displaystyle\omega$	$\displaystyle\sim\mathrm{Be}(b_{1},b_{2}),$
	$\displaystyle\theta$	$\displaystyle\sim\pi(\theta)=\frac{\theta^{a_{1}}}{[f(\theta)]^{a_{2}}}.$

Both assumptions establish a congruent model, present important computational advantages and, in addition, have a tradition in Bayesian statistical literature. However, in this paper, the covariates that affect $\omega$ and $\mu$ are fixed. So, we specify independent prior distributions for the parameters of the regression models, ie, $\vec{\beta}$ and $\vec{\gamma}$ , as follows:

	$\displaystyle\vec{\beta}$	$\displaystyle\sim U(a_{\beta},b_{\beta}),$
	$\displaystyle\vec{\gamma}$	$\displaystyle\sim U(a_{\gamma},b_{\gamma}),$

where the constants $\smash{a_{\beta}}$ , $\smash{b_{\beta}}$ , $\smash{a_{\gamma}}$ and $\smash{b_{\gamma}}$ are assumed to be known. In particular, $a_{\beta}=a_{\gamma}=-10^{5}$ and $b_{\beta}=b_{\gamma}=10^{5}$ , which expresses our lack of knowledge about the regression parameters. These noninformative uniform distributions are appropriate if no prior knowledge is available about the likely range of values of the parameters (see Lempers (1971) and Mitchell and Beauchamp (1988), among others).

Given that the prior distributions for parameters have been assessed, the next procedure is to combine the likelihood function in (2.11) with priors in order to make a Bayesian inference. Since no closed forms are available for marginal posterior distributions, numerical approaches have to be used to generate them. The numerical approaches used are based on simulations from the posterior distributions, which are proper, since we consider proper priors, although with somewhat large variances. The simulation approach used is MCMC, which can be conducted using the WinBUGS software. MCMC is in this setting a powerful tool that allows us to get estimates of the parameters involved. As is well known, MCMC is a method of posterior simulation and leads us to compute the posterior density function for arbitrary points in the parameter space. With MCMC, it is possible to generate samples from an arbitrary posterior density and use these samples to approximate expectations of quantities of interest, such as the mean or second-order moment. Several other aspects of the MCMC also contribute to its success. The methodology is very simple and consists of generating simulated samples from that posterior density, even though the density corresponds to unknown distributions. In this context, Gibbs sampling is a natural estimation method. Reasonable choices for starting values of $\vec{\beta}$ and $\vec{\gamma}$ for the MCMC simulation can be obtained by standard Poisson and negative binomial regression models using any statistical software package, such as Stata. In this work, all simulations were done using WinBUGS (Spiegelhalter et al 1999). We run three parallel chains and a single long chain for diagnostic assessment (checked using Coda software). A total of 100 000 iterations were carried out (after another 100 000 iterations of burn-in). A complete Gibbs sampling algorithm is outlined in Ghosh et al (2006). If Bayesian estimation in considered for the ZIP and ZIG models, these models will be denoted by BZIP and BZIG, respectively.

3 Experiments with insurance data

In this section, an application to the different models considered in the previous sections is developed in order to see how the proposed Bayesian method works. The data set considered was taken from the webpage of Macquarie University in Sydney, Australia, where the different data taken by Jong and Heller (2008) are available. This page contains numerous data that can be used in an actuarial setting.

3.1 The data

In particular, the data studied contains information from the policyholders of an Australian insurance company between 2004 and 2005. It describes certain characteristics related to the vehicles and the policyholders. The database contains 67 856 policies, of which 63 232 (93.18 $\%$ ) have no claims. This is high-frequency data, although the methodology proposed here is also valid for models with low-frequency data. Table 1 shows a descriptive summary of the dependent and independent variables.

Table 1: Descriptive summary of variables.

Variables	Mean	Variance	Maximum
Number of claims	0.07275	0.07739	04
Vehicle value	1.77702	1.45258	34.56
Gender	0.43110	0.24525	01
Young age	0.27436	0.19908	01
Medium age	0.57492	0.24439	01
Old age	0.46081	0.24846	01
Vehicle age	0.57492	0.24439	01

Table 2: Some measures of inflation.

	Sample	Poisson	Geometric	ZIP	ZIG
$p_{\text{0}}$	0.93180	0.92983	0.93218	0.93180	0.93186
$\kappa_{\text{3}}$	0.08757	0.07276	0.08941	0.08573	0.08705
$z_{i}$	0.03000	0.00000	0.03470	0.03000	0.03000
$\kappa$	0.20367	0.00000	0.22886	0.17832	0.19640

Table 2 shows some measures that lead us to consider departures from the Poisson distribution. These measures are the proportion of zeros, $p_{0}$ , the cumulant, $\kappa_{3}$ , the zero-inflation index, $z_{i}=1+\log(p_{0})/\mu$ , and the third central moment inflation index, $\kappa=\kappa_{3}/\mu-1$ (see Puig and Valero (2006) for details). In this table, we can see the sample values of these measures and their corresponding estimated values; these are obtained using the Poisson, geometric, ZIP and ZIG distributions after estimating the corresponding parameters using the maximum-likelihood method. We can see that the geometric distribution (in its inflated and non-inflated at zero versions) outperforms the Poisson one. In order to test whether there are too many zeros for the Poisson distribution, a Van den Broek (1995) score test has been conducted. This measure tests the null hypothesis $\mathrm{H}_{0}\colon\omega=0$ and is defined by

S(\hat{\beta})=\frac{\{\sum_{i=1}^{n}(1_{(y_{i}=0)}-\hat{p}_{0i})/\hat{p}_{0i}% \}^{2}}{\{\sum_{i=1}^{n}(1-\hat{p}_{0i})/\hat{p}_{0i}\}-n\bar{y}},

where $\smash{\hat{p}_{0i}=\widehat{\text{Prob}}(Y=0)}$ , the estimate probability of zero in the Poisson regression, and $\bar{y}$ is the average of the count observations. This statistic has a chi-squared distribution with 1 degree of freedom. The score statistic is given by 121 392.61 ( $p\text{-value}\ll 0.0001$ ), which provides evidence that the observed zeros exceed the zeros limit of the Poisson distribution.

For each policy, the initial information for the period considered and the existence (or otherwise) of at least a claim are reported within this yearly period. In total, four explanatory variables are considered, together with a dependent variable representing the number of claims. Vehicle value is represented in 10 000 Australian dollars. Vehicle age is equal to 1 if the vehicle is relatively young (seven years old or less). Gender is equal to 1 if the policyholder is a man. This variable is included in the model for didactic purposes, but, as expected, it is not relevant in any of the models considered. Finally, a categorical variable is considered to represent the age of the policyholder by dividing this feature into three dummies: young, medium and old ages. In this sense, we try to identify if there is/are age sets with a higher propensity to make claims. Several authors have previously used the age variable in a dichotomous way (see Boucher et al (2007), Bermúdez (2009) and Pérez et al (2014), among others).

3.2 Results, diagnostic and validation

Table 3 shows results under the Bayesian estimation of the ZIP model. As we can see, vehicle value and older policyholders (in relation to medium-aged policyholders) are relevant for the chance of being in zero-state. The positive value of the first coefficient indicates that this chance increases with the value of the vehicle (with significance at $1\%$ ). The negative scope of old age indicates that the chance of being in zero-state decreases for older policyholders in relation to medium-aged policyholders, at $5\%$ significance. However, an intercept of $-1.883$ (with significance at 1 $\%$ ) indicates that the average number of claims is lower for medium-aged policyholders. Further, the higher the vehicle value, the lower the average number of claims expected, again at $1\%$ significance. These results are consistent with the $\smash{\hat{\gamma}_{i}}$ zero-state coefficients.

Table 3: Estimation of BZIP model.

	$\hat{\gamma}_{i}$	$\hat{\beta}_{i}$
Intercept	$-$ 0.482	$-$ 1.883***
	(0.295)	(0.122)
Vehicle value	0.543***	$-$ 0.102***
	(0.096)	(0.026)
Gender	0.005	$-$ 0.028
	(0.181)	(0.077)
Young age	$-$ 0.0008	0.095
	(0.237)	(0.093)
Old age	$-$ 0.497**	0.018
	(0.212)	(0.101)
Vehicle age	0.070	$-$ 0.035
	(0.236)	(0.097)

Table 4 shows results under the Bayesian estimation of the ZIG model. Now, being in the medium-aged class increases the chance of zero-state (the $\alpha$ -intercept is relevant at $10\%$ ). The average number of claims increases if the vehicle value is high (at $1\%$ significance) and decreases in the medium- and old-aged classes (more for older policyholders). For the youngest people, the average number of claims is expected to be greater than for the other two groups of policyholders (with $1\%$ significance). Finally, the average number of claims decreases if the age of the vehicle increases.

Table 4: Estimation of BZIG model.

	$\hat{\gamma}_{i}$	$\hat{\beta}_{i}$
Intercept	5.142*	$-$ 2.637***
	(3.199)	(0.051)
Vehicle value	3.097	0.046***
	(3.328)	(0.022)
Gender	2.055	$-$ 0.020
	(3.962)	(0.032)
Young age	0.311	0.096***
	(4.628)	(0.040)
Old age	2.421	$-$ 0.211***
	(4.705)	(0.043)
Vehicle age	2.672	$-$ 0.054
	(4.441)	(0.037)*

Finally, it is interesting to compare the results from a Bayesian perspective with those obtained using standard methodology. Table 5 shows the results under two standard Poisson and negative binomial distributions, along with their zero-inflated versions. We observe that they detect the same relevant factors as the BZIG model, but these models do not distinguish between the zero-state and average number of claims factors.

Although there are a variety of methodologies to validate several models for a given data set, the DIC, which is a generalization of the AIC and BIC, is useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by MCMC simulation. Recall that the DIC is only valid when the posterior distribution is approximately multivariate normal, which is the case considered here. Out of all the criteria, the model that best fits a data set is the model with the smallest value. As we can see in Table 6, frequentist models are a worse fit than the BZIP and BZIG models. The deviance of BZIP is the smallest, but the confidence interval with $5\%$ significance is $[9336,9889]$ ; this includes the deviance of the BZIG model, so there is no significative difference in terms of fitting between these two models. By comparing Tables 4 and 5, we can observe that the estimated coefficients differ considerably, although the signs and the relevant factors remain the same. So, we can say that the results from the BZIG model are more consistent (than those from the BZIP model) with respect to the results from the classical models in Table 5, providing a much better fit.

Table 5: Frequentist estimation of the models.

		Negative
	Poisson	binomial	ZIP	ZIBN
Intercept	$-$ 2.6282***	$-$ 2.6305***	$-$ 2.0540***	0 $-$ 2.6306***
	(0.0407)	(0.0412)	(0.0696)	0(0.0412)
Vehicle value	0.0381***	0.0392***	0.0392***	00.0393***
	(0.0114)	(0.0117)	(0.0118)	0(0.0117)
Gender	$-$ 0.0159	$-$ 0.0161	$-$ 0.0164	0 $-$ 0.0162
	(0.0300)	(0.0300)	(0.0301)	0(0.0301)
Young age	0.0958***	0.0955***	0.0949***	00.0956***
	(0.0337)	(0.0337)	(0.0337)	0(0.0337)
Old age	$-$ 0.2081***	$-$ 0.2083***	$-$ 0.2082***	0 $-$ 0.2083***
	(0.0385)	(0.0385)	(0.0385)	0(0.0385)
Vehicle age	$-$ 0.0617*	$-$ 0.0606*	$-$ 0.0604*	0 $-$ 0.0606*
	(0.0334)	(0.0335)	(0.0335)	0(0.0335)
Inflation constant			$-$ 0.2849*	$-$ 14.1262***
			(0.1283)	0(0.7096)

Table 6: Measures for model validation.

	DIC	AIC	BIC
Poisson	36 118.353	36 130.353	36 185.105
Negative binomial	36 019.507	36 033.507	36 097.383
ZIP	36 026.881	36 040.881	36 104.757
ZINB	36 019.507	36 035.507	36 108.508
BZIP	09 611.000	09 623.000	09 744.502
BZIG	09 871.000	09 893.000	10 004.501

4 Final remarks

In this paper, we develop a Bayesian methodology using sampling-based methods in order to model an automobile insurance data set using discrete distributions belonging to the power series distributions. As a consequence, we get a new and flexible model when overdispersion and an inflation of zeros are present in the data set.

In order to validate this new Bayesian model, we present the results of an experiment using real data, collected between 2004 to 2005 from an Australian insurance company, and the MCMC method, which is developed using the WinBUGS package. The validation process includes comparisons with standard and Bayesian ZIP and ZIG models, in terms of parameter estimations and information criteria such as DIC, AIC and BIC. The results obtained here show that the new Bayesian method outperforms the previous standard and Bayesian models.

It should be of interest for future research to modify the model in order to take into account truncated and censored data, which is often seen in insurance claim data.

Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Acknowledgements

The authors would like to thank the editor, associate editor and the anonymous referees for their relevant and useful comments. The authors also thank the Ministerio de Economía y Competitividad (project ECO2013–47092, Ministerio de Economía y Competitividad, Spain) for partial support of this work.

References

Bermúdez, L. (2009). A priori ratemaking using bivariate Poisson regression models. Insurance: Mathematics and Economics 44(1), 135–141 (http://doi.org/fvhphv).

Bhattacharya, A., Clarke, B. S., and Datta, G. (2008). A Bayesian test for excess zeros in a zero-inflated power series distribution. Institute of Mathematical Statistics 1, 89–104 (http://doi.org/cmhd6b).

Boucher, J., Denuit, M., and Guillén, M. (2007). Risk classification for claim counts: a comparative analysis of various zero-inflated mixed Poisson and hurdle models. North American Actuarial Journal 11(4), 110–131 (http://doi.org/bqxr).

Cameron, C., and Trivedi, P. (1998). Regression Analysis of Count Data. Cambridge University Press (http://doi.org/bqxs).

Cohen, A. C. (1966). A note on certain discrete mixed distributions. Biometrics 22(3), 566–572 (http://doi.org/fm6p24).

Denuit, M., Marèchal, X., Pitrebois, S., and Walhin, J.-F. (2009). Actuarial Modelling of Claim Counts Risk Classification, Credibility and Bonus-Malus Systems. Wiley (http://doi.org/ftd6rt).

Ghosh, S., Mukhopadhyay, P., and Lu, J.-C. (2006). Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference 136(4), 1360–1375 (http://doi.org/bw9fb2).

Gupta, P., Gupta, R., and Tripathi, R. (1996). Analysis of zero-adjusted count data. Computational Statistical and Data Analysis 23(2), 207–218 (http://doi.org/bhn54p).

Gurmu, S. (1997). Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics 12, 225–242 (http://doi.org/ckj98m).

Hall, D. (2000). Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics 56(4), 1030–1039 (http://doi.org/b7g7s7).

Jacobs, M., Karagozoglu, A., and Sensenbrenner, F. (2015). Stress testing and model validation: application of the Bayesian approach to a credit risk portfolio. The Journal of Risk Model Validation 9(3), 41–70 (http://doi.org/bq5q).

Johnson, N., Kemp, A., and Kotz, S. (2005). Univariate Discrete Distributions. Wiley (http://doi.org/ct345v).

Jong, P. D., and Heller, G. (2008). Generalized Linear Models for Insurance Data. Cambridge University Press (http://doi.org/df2fxf).

Kosambi, D. (1949). Characteristic properties of series distributions. Proceedings of the National Institute for Science, India 15, 109–113.

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1), 1–14 (http://doi.org/fp557w).

Lempers, F. (1971). Posterior Probabilities of Alternative Linear Models. Rotterdam University Press.

Mitchell, T., and Beauchamp, J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83(404), 1023–1032 (http://doi.org/bqxt).

Mouatassim, Y., and Ezzahid, E. (2012). Poisson regression and zero-inflated Poisson regression: application to private health insurance data. European Actuarial Journal 2(2), 187–204 (http://doi.org/bqxv).

Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics 33(3), 341–365 (http://doi.org/b6ff6h).

Noack, A. (1950). A class of random variables with discrete distributions. Annals of Mathematical Statistics 21(1), 127–132 (http://doi.org/bcf2b7).

Parnes, D. (2015). Bayesian synthesis of portfolio credit risk with missing ratings. The Journal of Risk 18(1), 1–25 (http://doi.org/bq5r).

Patil, G. (1962a). Estimation by two-moments method for generalized power series distribution and certain applications. Sankhya: The Indian Journal of Statistics B 24(3/4), 201–214.

Patil, G. (1962b). Maximum likelihood estimation for generalized power series distributions and its application to a truncated binomial distribution. Biometrika 49(1/2), 227–237 (http://doi.org/dhr26j).

Pérez, J., Negrín, M., García, C., and Gómez-Déniz, E. (2014). Bayesian asymmetric logit model for detecting risk factors in motor ratemaking. ASTIN Bulletin 44(2), 445–457 (http://doi.org/bqxw).

Puig, P., and Valero, J. (2006). Count data distributions: some characterizations with applications. Journal of the American Statistical Association 101(473), 332–340 (http://doi.org/dxbdpf).

Ridout, M., Demétrio, C., and Hinde, J. (1998). Models for count data with many zeros. Working Paper, December, International Biometric Conference, Cape Town, South Africa.

Spiegelhalter, D., Thomas, A., and Best, N. (1999). WinBUGS (Version 1.2). MRC Biostatistics Unit, Cambridge.

Spiegelhalter, D., Best, N., Carlin, B., and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society B 64(4), 583–639 (http://doi.org/dfrgt6).

Spiegelhalter, D., Best, N., Carlin, B., and van der Linde, A. (2014). The deviance information criterion: 12 years on (with discussion). Journal of the Royal Statistical Society B 76(3), 485–493 (http://doi.org/bqxx).

Van den Broek, J. (1995). A score test for zero inflation in a Poisson distribution. Biometrics 51, 738–743 (http://doi.org/bpkr4h).

Yip, K., and Yau, K. (2005). On modeling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics 36(2), 153–163 (http://doi.org/fbhjpk).

[1] Bermúdez, L. (2009). A priori ratemaking using bivariate Poisson regression models. Insurance: Mathematics and Economics 44(1), 135–141 (http://doi.org/fvhphv).

[2] Bhattacharya, A., Clarke, B. S., and Datta, G. (2008). A Bayesian test for excess zeros in a zero-inflated power series distribution. Institute of Mathematical Statistics 1, 89–104 (http://doi.org/cmhd6b).

[3] Boucher, J., Denuit, M., and Guillén, M. (2007). Risk classification for claim counts: a comparative analysis of various zero-inflated mixed Poisson and hurdle models. North American Actuarial Journal 11(4), 110–131 (http://doi.org/bqxr).

[4] Cameron, C., and Trivedi, P. (1998). Regression Analysis of Count Data. Cambridge University Press (http://doi.org/bqxs).

[5] Cohen, A. C. (1966). A note on certain discrete mixed distributions. Biometrics 22(3), 566–572 (http://doi.org/fm6p24).

[6] Denuit, M., Marèchal, X., Pitrebois, S., and Walhin, J.-F. (2009). Actuarial Modelling of Claim Counts Risk Classification, Credibility and Bonus-Malus Systems. Wiley (http://doi.org/ftd6rt).

[7] Ghosh, S., Mukhopadhyay, P., and Lu, J.-C. (2006). Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference 136(4), 1360–1375 (http://doi.org/bw9fb2).

[8] Gupta, P., Gupta, R., and Tripathi, R. (1996). Analysis of zero-adjusted count data. Computational Statistical and Data Analysis 23(2), 207–218 (http://doi.org/bhn54p).

[9] Gurmu, S. (1997). Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics 12, 225–242 (http://doi.org/ckj98m).

[10] Hall, D. (2000). Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics 56(4), 1030–1039 (http://doi.org/b7g7s7).

[11] Jacobs, M., Karagozoglu, A., and Sensenbrenner, F. (2015). Stress testing and model validation: application of the Bayesian approach to a credit risk portfolio. The Journal of Risk Model Validation 9(3), 41–70 (http://doi.org/bq5q).

[12] Johnson, N., Kemp, A., and Kotz, S. (2005). Univariate Discrete Distributions. Wiley (http://doi.org/ct345v).

[13] Jong, P. D., and Heller, G. (2008). Generalized Linear Models for Insurance Data. Cambridge University Press (http://doi.org/df2fxf).

[14] Kosambi, D. (1949). Characteristic properties of series distributions. Proceedings of the National Institute for Science, India 15, 109–113.

[15] Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1), 1–14 (http://doi.org/fp557w).

[16] Lempers, F. (1971). Posterior Probabilities of Alternative Linear Models. Rotterdam University Press.

[17] Mitchell, T., and Beauchamp, J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83(404), 1023–1032 (http://doi.org/bqxt).

[18] Mouatassim, Y., and Ezzahid, E. (2012). Poisson regression and zero-inflated Poisson regression: application to private health insurance data. European Actuarial Journal 2(2), 187–204 (http://doi.org/bqxv).

[19] Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics 33(3), 341–365 (http://doi.org/b6ff6h).

[20] Noack, A. (1950). A class of random variables with discrete distributions. Annals of Mathematical Statistics 21(1), 127–132 (http://doi.org/bcf2b7).

[21] Parnes, D. (2015). Bayesian synthesis of portfolio credit risk with missing ratings. The Journal of Risk 18(1), 1–25 (http://doi.org/bq5r).

[22] Patil, G. (1962a). Estimation by two-moments method for generalized power series distribution and certain applications. Sankhya: The Indian Journal of Statistics B 24(3/4), 201–214.

[23] Patil, G. (1962b). Maximum likelihood estimation for generalized power series distributions and its application to a truncated binomial distribution. Biometrika 49(1/2), 227–237 (http://doi.org/dhr26j).

[24] Pérez, J., Negrín, M., García, C., and Gómez-Déniz, E. (2014). Bayesian asymmetric logit model for detecting risk factors in motor ratemaking. ASTIN Bulletin 44(2), 445–457 (http://doi.org/bqxw).

[25] Puig, P., and Valero, J. (2006). Count data distributions: some characterizations with applications. Journal of the American Statistical Association 101(473), 332–340 (http://doi.org/dxbdpf).

[26] Ridout, M., Demétrio, C., and Hinde, J. (1998). Models for count data with many zeros. Working Paper, December, International Biometric Conference, Cape Town, South Africa.

[27] Spiegelhalter, D., Thomas, A., and Best, N. (1999). WinBUGS (Version 1.2). MRC Biostatistics Unit, Cambridge.

[28] Spiegelhalter, D., Best, N., Carlin, B., and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society B 64(4), 583–639 (http://doi.org/dfrgt6).

[29] Spiegelhalter, D., Best, N., Carlin, B., and van der Linde, A. (2014). The deviance information criterion: 12 years on (with discussion). Journal of the Royal Statistical Society B 76(3), 485–493 (http://doi.org/bqxx).

[30] Van den Broek, J. (1995). A score test for zero inflation in a Poisson distribution. Biometrics 51, 738–743 (http://doi.org/bpkr4h).

[31] Yip, K., and Yau, K. (2005). On modeling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics 36(2), 153–163 (http://doi.org/fbhjpk).

Journal of Risk Model Validation

On modeling zero-inflated insurance data

J. M. Pérez Sánchez and E. Gómez-Déniz

Need to know

Abstract

Introduction