Journal of Computational Finance
ISSN:
1460-1559 (print)
1755-2850 (online)
Editor-in-chief: Christoph Reisinger
Need to know
- Develop kriging/Gaussian process models for simulation-based optimal stopping.
- Investigate impact of different experimental designs.
- Advocate the use of batched simulations to improve statistical modeling.
- Benchmark the results on 3 different sets of Bermudan contracts.
Abstract
We investigate two new strategies for the numerical solution of optimal stopping problems in the regression Monte Carlo (RMC) framework proposed by Longstaff and Schwartz in 2001. First, we propose using stochastic kriging (Gaussian process) metamodels for fitting the continuation value. Kriging offers a flexible, non-parametric regression approach that quantifies approximation quality. Second, we connect the choice of stochastic grids used in RMC with the design of experiments (DoE) paradigm. We examine space-filling and adaptive experimental designs; we also investigate the use of batching with replicated simulations at design sites to improve the signal-to-noise ratio. Numerical case studies for valuing Bermudan puts and max calls under a variety of asset dynamics illustrate that our methods offer a significant reduction in simulation budgets compared with existing approaches.
Introduction
1 Introduction
The problem of the efficient valuation and optimal exercise of American and Bermudan options remains one of intense interest in computational finance. The underlying timing flexibility, which is mathematically mapped to an optimal stopping objective, is ubiquitous in financial contexts, showing up both directly in various derivatives and as a building block in more complicated contracts, such as multiple-exercise opportunities or sequential decision making. Since timing optionality is often embedded in addition to other features, one would ideally have access to a flexible pricing architecture, into which one might easily import optimal stopping subroutines. While various alternatives have been proposed and investigated, most existing methods are not fully scalable across a full range of applications, keeping the holy grail of such a “black box” optimal stopping algorithm out of reach.11By “black box”, we mean a “smart” framework that automatically adjusts low-level algorithm parameters based on the problem setting. It might still require some high-level user input, but it avoids extensive fine-tuning or, at the other extreme, a hard-coded, nonadaptive method. This is due either to a reliance on approaches that work for a limited subset of models only (this applies to one-dimensional integral equations as well as various semianalytic formulas) or to severe curses of dimensionality that cause a rapid deterioration in performance as model complexity increases.
Within this landscape, the regression Monte Carlo (RMC) framework – often called the least squares Monte Carlo (LSMC), as discussed in Section 2.2 – has emerged as perhaps the most popular generic method for tackling optimal stopping. The intrinsic scalability of the Monte Carlo approach implies that if the problem becomes more complex, one may simply run more simulations, with the underlying implementation essentially independent of dimensionality, model dynamics, etc. However, the comparatively slow convergence rate of the RMC places emphasis on obtaining more accurate results with fewer simulated paths, spurring an active area of research (see, for example, Agarwal et al 2016; Belomestny et al 2015; Gobet and Turkedjiev 2017; Hepperger 2013; Jain and Oosterlee 2015; Létourneau and Stentoft 2014; Tompaidis and Yang 2014).
In this paper, we propose a novel marriage of modern statistical frameworks and the optimal stopping problem, making two methodological contributions. First, we examine the use of kriging metamodels as part of a simulation-based routine for approximating optimal stopping policies. Kriging offers a flexible nonparametric method (with several additional convenient features discussed below) of approximating the continuation value. As we demonstrate, it is competitive with existing basis expansion setups. To our knowledge, this is the first paper to use kriging models in optimal stopping. Second, we investigate the experimental design aspect of the RMC approach. We propose several alternative methods of generating and customizing the respective stochastic grids, drawing the reader’s attention to the accompanying performance gains. Among others, we investigate the adaptive generation of the simulation designs, extending the ideas of Gramacy and Ludkovski (2015), who originally proposed the use of sequential design for optimal stopping. Rather than purely targeting speed savings, our strategies explore opportunities to reduce the simulation budget of RMC through “smart” regression and design approaches. In particular, the combination of kriging and improved experimental design brings savings in simulation costs of up to an order of magnitude, which can be roughly equated to memory requirements. Despite the considerable added overhead, these gains indicate the potential of these techniques for optimizing the speed–memory trade-off. They also show the benefits of approaching RMC as an iterated metamodeling/response surface modeling (RSM) problem.
Our framework is an outgrowth of a recent paper by Gramacy and Ludkovski (2015). In relation to that paper, the present one makes several modifications that are attractive in the derivative valuation context. First, while Gramacy and Ludkovski (2015) suggested the use of piecewise-defined dynamic trees regression, the kriging metamodels are intrinsically continuous. As such, they are better suited to a typical financial application where the value function is smooth. Second, to reduce the computational overhead, we introduce a new strategy of batching by generating multiple scenarios for a given initial condition. Third, in contrast with Gramacy and Ludkovski (2015) and its focus on sequential designs, we also provide a detailed examination and comparison of various static experimental designs. Our experiments indicate that this achieves significant simulation savings with a much smaller overhead, making it one of the “sweet spots” in terms of overall performance.
The rest of the paper is organized as follows. Section 2 sets the notation we use for a generic optimal stopping problem as well as the RMC paradigm for its numerical approximation. Along the way, we recast RMC as a sequence of metamodeling tasks. Section 3 introduces and discusses the stochastic kriging metamodels that we propose to use for the empirical fitting of the continuation value. Section 4 then switches to the second main thrust of this paper, namely the issue of experimental designs for RMC. We investigate space-filling designs as well as the idea of batching or replication. Section 5 marries the kriging methodology with the framework of sequential design to obtain an efficient approach for generating designs adapted to the loss criterion of RMC. In Section 6, we present a variety of case studies, including a classical bivariate geometric Brownian motion (GBM) model, several stochastic volatility setups, and multivariate GBM models with multiple stopping regions. Finally, Section 7 summarizes our proposals and findings.
2 Model
We consider a discrete-time optimal stopping problem on a finite horizon. Let , be a Markov state process on a stochastic basis , taking values in some (usually uncountable) subset . The financial contract in question has a maturity date of and can be exercised at any earlier date with payoff . Note that in the financial applications this corresponds to a Bermudan contract with a unit of time corresponding to the underlying discretization of early exercise opportunities. The dependence of on is to incorporate interest rate discounting.
Let be the information filtration generated by , while is the collection of all stopping times bounded by . The Bermudan contract valuation problem consists of maximizing the expected reward over all . More precisely, define for any
(2.1) |
where denotes the -expectation given initial condition . We assume that is such that for any , for instance, bounded.
Using the tower property of conditional expectations and one-step analysis,
(2.2) |
where we define the continuation value and the timing value via
(2.3) | ||||
(2.4) |
Since for all and , the smallest optimal stopping time for (2.1) satisfies
(2.5) |
Thus, it is optimal to stop immediately if and only if the conditional expectation of tomorrow’s reward-to-go is less than the immediate reward. To rephrase, figuring out the decision to stop at is equivalent to finding the zero level set (or contour) of the timing value or classifying the state space into the stopping region and its complement, the continuation region. By induction, an optimal stopping time is
(2.6) |
From a financial perspective, obtaining the exercise strategy as defined by is often even more important than finding the present value of the contract.
The representation (2.6) suggests a recursive procedure for estimating the optimal policy as defined by . To wit, given a collection of subsets of , which are stand-ins for the stopping regions, we may define the corresponding exercise strategy pathwise for a scenario as follows:
(2.7) |
Equipped with , we obtain the pathwise payoff
(2.8) |
Taking expectations, is the expected payoff starting at , not exercising immediately at step and then following the exercise strategy specified by . Our notation highlights the latter (path-) dependence of on future stopping regions, denoted as . If for all , we can use (2.8) to recover . Indeed, the expected value of in this case is precisely the continuation value . Consequently, finding (and, hence, the value function via (2.2)) is reduced to computing conditional expectations of the form . Because analytic evaluation of this map is not typically tractable, this is the starting point for Monte Carlo approximations that require only the ability to simulate trajectories. In particular, simulating independent scenarios , emanating from yields the estimator via
(2.9) |
A key observation is that the approximation quality of (2.9) is driven by the accuracy of the stopping sets that determine the exercise strategy and control the resulting errors (formally, we should thus use a double-hat notation to contrast the estimate in (2.9) with the true expectation ). Thus, for the subproblem of approximating the conditional expectation of , there is a thresholding property; this means that regression errors are tolerated as long as they do not affect the ordering of and . This turns out to be a major advantage compared with the value function approximation techniques outlined in Tsitsiklis and Van Roy (2001), which use (2.2) directly, without making explicit use of pathwise payoffs. See also Stentoft (2014) for a recent comparison of value function and stopping time approximations.
2.1 Metamodeling
In (2.9) the estimate of was obtained pointwise for a given location through a plain Monte Carlo approach. Alternatively, given paths , , emanating from different initial conditions , one may borrow information cross-sectionally by employing a statistical regression framework. A regression model specifies the link,
(2.10) |
where is the mean-zero noise term with variance . This noise arises from the random component of the pathwise payoff due to the internal fluctuations in and, hence, . Letting be a vector of obtained pathwise payoffs, one regresses against to fit an approximation . Evaluating at any then yields the predicted continuation value from the exercise strategy conditional on .
The advantage of regression is that it replaces pointwise estimation (which requires rerunning additional scenarios for each location ) with a single step that fuses all the information contained in to fit . Armed with the estimated , we take the natural estimator of the stopping region at ,
(2.11) |
which yields an inductive procedure for approximating the full exercise strategy via (2.7). Note that (2.11) is a double approximation of . This is due to the discrepancy between and the true expectation from (2.8). An additional approximation is caused by partially propagating any previous errors in , relative to the true .
In the machine learning and simulation optimization literatures, the problem of obtaining based on (2.10) is known as metamodeling or emulation (Kleijnen 2015; Ankenman et al 2010; Santner et al 2013). It consists of two fundamental steps: first, a design (ie, a grid) is constructed and the corresponding pathwise payoffs are realized via simulation. Second, a regression model (2.10) is trained to link with via an estimated . In the context of (2.3), emulation can be traced to the seminal works of Tsitsiklis and Van Roy (2001) and Longstaff and Schwartz (2001) (although the idea of applying regression originated earlier: see, for example, Carrière (1996)). The above references pioneered classical least squares regression (also known by the acronyms LSM (least squares method), LSMC and, more generally, RMC and regression-based schemes) for (2.10), that is, the use of projection onto a finite family of basis functions ,
The projection was carried out by minimizing the distance between and the manifold spanned by the basis functions (akin to the definition of conditional expectation):
(2.12) |
where the loss function is a weighted norm based on the distribution of ,
(2.13) |
Motivated by the weights in (2.13), the accompanying design is commonly generated by independent and identically distributed (iid) sampling from . As suggested in the original papers of Longstaff and Schwartz (2001) and Tsitsiklis and Van Roy (2001), this can be efficiently implemented by a single global simulation of -paths at the cost of introducing further -dependencies among .
Returning to the more abstract view, least squares regression (2.12) is an instance of a metamodeling technique. Moreover, it obscures the twin issue of generating . Indeed, in contrast to standard statistical setups, in metamodeling there is no a priori data per se; instead, the solver is responsible both for generating the data and for training the model. These two tasks are intrinsically intertwined. The subject of design of experiments (DoE) has developed around this issue, but it has been largely absent in RMC approaches. In the next section, we reexamine existing RMC methods and propose novel RMC algorithms that build on the DoE/metamodeling paradigm by targeting efficient designs, .
2.2 A closer look at the RMC problem
Figure 1 shows the distribution of in a one-dimensional geometric Brownian motion Bermudan put option problem (see Section 4.4), an archetypal financial application. Figure 1(a) shows a scatter plot of as varies, while 1(b) gives a histogram of for a fixed initial value . Two features become apparent from these plots. First, the noise variance is extremely large, swamping the actual shape of . Second, the noise distribution is highly nonstandard. It involves a potent brew of (i) heavy tails, (ii) significant skewness and (iii) multimodality. In fact, does not even have a smooth density, as the nonnegativity of the payoff implies that has a point mass at zero.
Moreover, as can be observed in Figure 1(a), the distribution of is heteroscedastic, with a state-dependent conditional skew and state-dependent point mass at zero. These phenomenons violate the assumptions of ordinary least squares (OLS) regression, making the sampling distribution of the fitted ill behaved, that is, far from Gaussian. More robust versions of (2.12), including regularized (eg, Lasso) or localized (eg, Loess or piecewise linear models) least squares frameworks, have therefore been proposed (see, for instance, Kohler 2010; Kohler and Krzyżak 2012). Further structured regression proposals may be found in Kan and Reesor (2012) and Létourneau and Stentoft (2014). Alternatively, a range of variance reduction methods, such as control variates, have been suggested to ameliorate the estimation of in (2.12) (see, for example, Agarwal et al 2016; Belomestny et al 2015; Hepperger 2013; Jain and Oosterlee 2015; Juneja and Kalra 2009).
The parametric constraints placed by the LSMC approach (2.12) on the shape of the continuation value , make its performance highly sensitive to the distance between the manifold and the true (Stentoft 2004). Moreover, when using (2.12), a delicate balance must be observed between the number of scenarios and the number of basis functions . These concerns highlight the inflexible nature of parametric regression and become extremely challenging in higher dimensional settings with unknown geometry of . One work-around is to employ hierarchical representations, for example, by partitioning into disjoint bins and employing independent linear hyperplane fits in each bin (see Bouchard and Warin 2011). Nevertheless, ongoing challenges remain in the adaptive construction of .
Another limitation of the standard LSMC approach is its objective function. As discussed, least squares regression can be interpreted as minimizing a global weighted mean-squared error loss function. However, as shown in (2.5), the principal aim in optimal stopping is not to learn globally in the sense of its mean-squared error but to rank it vis-à-vis . Indeed, recalling the definition of the timing function, the correct loss function (see Gramacy and Ludkovski 2015) is
(2.14) |
Consequently, the loss criterion is localized around the contour . Indeed, a good intuition is that optimal stopping is effectively a classification problem; in some regions of , the sign of is easy to detect and the stopping decision is “obvious”, while in other regions is close to zero and resolving the optimal decision requires more statistical discrimination. For example, in Bermudan options, it is a priori clear that out-of-the-money (OTM) , so it is optimal to continue at such . This observation was made by Longstaff and Schwartz (2001), who proposed regressing in-the-money (ITM) trajectories only. However, it also belies the apparent inefficiency in the corresponding design; that is, the widely used LSMC implementation first “blindly” generates a design that covers the full domain of , only to discard all OTM scenarios. This dramatically shrinks the effective design size, up to 80% in the case of an OTM option. The mismatch between (2.14) and (2.13) makes the terminology of LSMC, with its exclusive emphasis on the criterion, unfortunate.
While one could directly use (2.14) during the cross-sectional regression (to estimate the coefficients in (2.12), for example, though this would require implementing an entirely new optimization problem), a much more effective approach is to take this criterion into account during the experimental design. Intuitively, the local accuracy of any approximation of is limited by the amount of observations in the neighborhood, that is, the density of around . The form of (2.14), therefore, suggests that the ideal approach would be to place along the stopping boundary in analogy to importance sampling. Of course, the obvious challenge is that knowing is equivalent to solving the stopping problem. Nevertheless, this perspective suggests multiple improvements to the baseline approach, which is entirely oblivious to (2.14) when picking .
2.3 Outline of proposed approach
Motivated by the above discussion, we seek efficient metamodels for obtaining that can exploit the local nature of (2.14). This leads us to consider nonparametric, kernel-based frameworks, specifically Gaussian process regression, also known as kriging (Ankenman et al 2010; Kleijnen 2015; Williams and Rasmussen 2006). We then explore a range of DoE approaches for constructing . Proposed design families include (i) uniform gridding, (ii) random and deterministic space-filling designs, and (iii) adaptive sequential designs based on expected improvement criteria. While only the last choice is truly tailored to (2.14), we explain that reasonably well-chosen versions of (i) and (ii) can already be a significant improvement on the baseline.
To formalize construction of , we rephrase (2.10) as an RSM (or metamodeling) problem. RSM is concerned with the task of estimating an unknown function that is noisily observed through a stochastic sampler,
(2.15) |
where we remind the reader to substitute in their mind and ; is treated as a parameter. Two basic requirements in RSM are a flexible nonparametric approximation architecture that is used to search for the outputted fit , and global consistency: specifically, convergence to the ground truth as the number of simulations increases without bound. The experimental design problem then aims to maximize the information obtained from samples toward the end of minimizing the specified loss function .
The pathwise realizations of involve simulation noise that must be smoothed out and, as we saw, is often highly non-Gaussian. To alleviate this issue, we mimic the plain Monte Carlo strategy in (2.9) that uses the classical law of large numbers to obtain a local estimate of by constructing batched or replicated designs. Batching generates multiple independent samples , , at the same ; these samples are then used to compute the empirical average . Clearly, follows the same statistical model (2.15) but with a signal-to-noise ratio improved by a factor of , . This also reduces the post-averaged design size , which speeds up and improves the training of the metamodel.
To recap, kriging gives a flexible nonparametric representation for . The fitting method automatically detects the spatial structure of the continuation value, obviating the need to specify the approximation function class. While the user must pick the kernel family, extensive experiments suggest that these have a second-order effect. Also, the probabilistic structure of kriging offers convenient goodness-of-fit diagnostics after is obtained, yielding a tractable quantification of posterior uncertainty. This can be exploited for DoE (see Section 5 for details of how).
3 Kriging metamodels
Kriging models have emerged as perhaps the most popular framework for DoE over continuous input spaces . In kriging, the cross-sectional interpolation is driven by the spatial smoothness of and essentially consists of aggregating the observed values of in the neighborhood of . A book-length treatment of kriging can be found in Williams and Rasmussen (2006) (see also Kleijnen 2015; Ankenman et al 2010, Chapter 5). To be precise, in this paper, we deal with stochastic kriging under heterogeneous sampling noise.
Stochastic kriging metamodels follow a Bayesian paradigm, treating in (2.15) as a random object belonging to a function space . Thus, for each , is a random variable whose posterior distribution is obtained based on the information collected from samples and its prior distribution . Let be the information generated by the design , that is, . We then define the posterior distribution of . The global map is called the surrogate surface (a misnomer, since is in fact measure valued). Its first two moments are the surrogate mean and variance, respectively:
(3.1) | ||||
(3.2) |
Note that in this function-view framework, a point estimate of the unknown is replaced by a full posterior distribution over the space ; tractability is achieved by imposing and maintaining a Gaussian structure. The surrogate mean surface then serves as the natural proxy (being the maximum a posteriori probability estimator) for , while the surrogate variance is the proxy for the standard error of .
To model the relationship between and at different locations , kriging uses the reproducing kernel Hilbert space (RKHS) approach, treating the full as a realization of a mean-zero Gaussian process. If necessary, is first “de-trended” to justify the mean-zero assumption. The Gaussian process generating is based on a covariance kernel , with . The resulting function class forms a Hilbert space. The RKHS structure implies that both the prior and posterior distributions of given are multivariate Gaussian.
By specifying the correlation behavior, the kernel encodes the smoothness of the response surfaces drawn from the Gaussian process , which is measured in terms of the RKHS norm . The two main examples that we use are the squared exponential kernel,
(3.3) |
and the Matérn 5/2 kernel,
(3.4) |
which both use the weighted Euclidean norm
The length-scale vector controls the smoothness of the members of , meaning the smaller its value, the rougher the members. The variance parameter determines the amplitude of fluctuations in the response. In the limit , elements of concentrate on the linear functions, effectively embedding linear regression; that is, if , elements of are constants. For both of the above cases, members of the function space can uniformly approximate any continuous function on any compact subset of .
The use of different length scales for different coordinates of allows anisotropic kernels that reflect the varying smoothness of in terms of its different input coordinates. For example, in a Bermudan option valuation, continuation values would typically be more sensitive to the asset price than to a stochastic volatility factor , which would be reflected in .
Let denote the observed noisy samples at locations . Given the data and the kernel , the posterior of again forms a Gaussian process; in other words, any collection is multivariate Gaussian with means , covariances and variances , specified by the following matrix equations (Williams and Rasmussen 2006, Section 2.7):
(3.5) | ||||
(3.6) |
with , and the positive definite matrix , . The diagonal structure of arises from the assumption that independent simulations have uncorrelated noise, that is, they use independent random numbers. Note that the uncertainty, , associated with the prediction at has no direct dependence on the simulation outputs ; all response points contribute to the estimation of the local error through their influence on the induced correlation matrix . Well-explored regions will have lower , while regions with few or no observations (in particular, regions beyond the training design) will have high .
3.1 Training a kriging metamodel
To fit a kriging metamodel requires specifying , that is, fixing the kernel family and then estimating the respective kernel hyperparameters (such as in (3.4)) that are plugged into (3.5)–(3.6). To do this, one would typically use a maximum likelihood approach; this would entail solving a nonlinear optimization problem to find the maximum likelihood estimators (MLEs) defining . Alternatively, cross-validation techniques or expectation–maximization (EM) algorithms are also available.
While the choice of kernel family is akin to the choice of orthonormal basis in LSMC, in our experience, they play only a marginal role in the performance of the overall pricing method. As a rule of thumb, we suggest the use of either the Matérn 5/2 kernel or the squared exponential kernel. In both cases, the respective function spaces are smooth (twice differentiable for the Matérn 5/2 family (3.4) and for the squared exponential kernel) and lead to similar fits. The performance is more sensitive to the kernel hyperparameters, which are typically estimated in a frequentist framework but could also be inferred from a Bayesian prior, or even guided by heuristics.
Remark 3.1.
One may combine a kriging model with a classical least squares regression on a set of basis functions . This is achieved by taking , where is a trend term as in (2.12), the are the trend coefficients to be estimated, and is the “residual” mean-zero Gaussian process. The universal kriging formulas (see Picheny et al 2010; Williams and Rasmussen 2006, Section 2.7) then allow simultaneous computation of the kriging surface and the OLS coefficients . For example, one may use the European option price (if one is explicitly available) as a basis function to capture some of the structure in .
Inference on the kriging hyperparameters requires knowledge of the sampling noise . Indeed, while it is possible to train and to infer a constant simulation variance simultaneously (the latter is known as the “nugget” in the machine learning community (Santner et al 2013)), with state-dependent noise, is unidentifiable. Batching allows us to circumvent this challenge by providing empirical estimates of . For each site , we sample independent replicates and estimate the conditional variance as
(3.7) |
is the batch mean. We then use as a proxy for the unknown to estimate . One could further improve the estimation of by fitting an auxiliary kriging metamodel from (see Kleijnen (2015), Chapter 5.6 and the references therein). As shown in Picheny and Ginsbourger (2013), Section 4.4.2, after fixing we can treat the samples at as a single design entry with noise variance . The end result is that the batched data set has just rows that are fed into the kriging metamodel.
For our examples below, we have used the R package DiceKriging (Roustant et al 2012). The software takes the input locations , the corresponding simulation outputs and the noise levels , as well as the kernel family (Matérn 5/2 (3.4) by default), and returns a trained kriging model. One then can utilize a predict call to evaluate (3.5)–(3.6) at given values of . The package can also implement universal kriging. Finally, in cases where the design is not batched or the batch size is very small (below twenty or so), we resort to assuming a homoscedastic noise level that is estimated as part of training the kriging model (an option available in DiceKriging). The advantage of such plug-and-play functionality is the access it offers to an independently tested, speed-optimized, debugged and community-vetted code that minimizes implementation overheads and future maintenance (while perhaps introducing some speed inefficiencies).
Remark 3.2.
The kriging framework includes several well-developed generalizations, including treed Gaussian processes (Gramacy 2007), local Gaussian processes (Gramacy and Apley 2015), monotone Gaussian processes (Riihimäki and Vehtari 2010) and particle-based Gaussian processes (Gramacy and Polson 2011), that can be substituted instead of the vanilla method described above, similar to replacing plain OLS with more sophisticated approaches. All the aforementioned extensions can be used off-the-shelf through the public R packages described in the above references.
3.2 Approximation quality
To quantify the accuracy of the obtained kriging estimator , we derive an empirical estimate of the loss function . To do so, we integrate the posterior distributions vis-à-vis the surrogate means that are proxies for using (2.14) as follows:
(3.8) |
where are the standard Gaussian cumulative/probability density functions. The quantity
gives precisely the Bayesian posterior probability of making the wrong exercise decision at based on the information from ; thus, a lower indicates a better performance of the design in terms of stopping optimally. Integrating over then gives an estimate for :
(3.9) |
Practically, we replace the latter integral with a weighted sum over ,
3.3 Further reproducing kernel Hilbert space regression methods
From an approximation theory perspective, kriging is an example of a regularized kernel regression. The general formulation looks for the minimizer of the following penalized residual sum of squares (RSS) problem:
(3.10) |
where is a chosen smoothing parameter and is an RKHS. The summation in (3.10) is a measure of the closeness of to the data, while the rightmost term penalizes the fluctuations of to avoid overfitting. The representer theorem implies that the minimizer of (3.10) has an expansion in terms of the eigenfunctions
(3.11) |
relating the prediction at to the kernel functions based at the design sites ; compare with (3.5). In kriging, the function space is , , and the corresponding norm has a spectral decomposition in terms of differential operators (Williams and Rasmussen 2006, Chapter 6.2).
Another popular version of (3.10) is based on the smoothing splines (also known as thin-plate splines (TPSs)) that take , where denotes the Euclidean norm in . In this case, the RKHS is the set of all twice continuously differentiable functions, as per Hastie et al (2009), Chapter 5, and
(3.12) |
This generalizes the one-dimensional penalty function . Note that under and (3.12) the optimization in (3.10) reduces to the traditional least squares linear fit , since it introduces the constraint . A common parameterization for the smoothing parameter is through the effective degrees of freedom statistic, ; one may also select adaptively via cross-validation or MLE (Hastie et al 2009, Chapter 5). The resulting optimization of (3.10), based on (3.12), gives a smooth response surface, that is, a TPS, and has the explicit form
(3.13) |
See Kohler (2008) for an implementation of RMC via splines.
A further RKHS approach, applied to RMC in Tompaidis and Yang (2014), is furnished by the so-called (Gaussian) radial basis functions. These are based on the already mentioned Gaussian kernel , but they substitute the penalization in (3.10) with ridge regression, reducing the sum in (3.11) to terms by identifying/optimizing the “prototype” sites .
4 Designs for kriging regression Monte Carlo
Fixing the metamodeling framework, we turn to the problem of generating the experimental design . We work in a fixed-budget setting, with a prespecified number of simulations to run. Optimally constructing the respective experimental design requires solving an -dimensional optimization problem, which is generally intractable. Accordingly, in this section, we discuss heuristics for generating near-optimal static designs. Section 5 then considers another work-around, namely, sequential methods that utilize a divide-and-conquer approach.
4.1 Space-filling designs
A standard strategy in experimental design is to spread out the locations to extract as much information as possible from the samples . A space-filling design attempts to distribute evenly in the input space , here distinguished from the theoretical domain of . While this approach does not directly target the objective (2.14), assuming is selected reasonably, it can still be much more efficient than traditional LSMC designs. Examples of space-filling approaches include maximum entropy designs, maximin designs and maximum expected information designs (Kleijnen 2015). Alternatively, one might choose a quasi-Monte Carlo (QMC) method; approaches of this kind use a deterministic space-filling sequence, such as the Sobol, for generating . Gridded designs represent the simplest choice, picking on a prespecified lattice (see Figure 2). Quasi-random/low-discrepancy sequences have already been used in American option pricing but for different purposes. Boyle et al (2013), for instance, used them for approximating one-step values in a stochastic mesh approach; while Chaudhary (2005) used them for generating the trajectories of . Here, we propose using quasi-random sequences directly for constructing the simulation design (see also Yang and Tompaidis (2013), who employ a similar strategy in a stochastic control application).
One may also generate random space-filling designs. This proves advantageous in our context of running multiple response surface models (indexed by ), since it avoids any grid-ghosting effects. One flexible procedure is Latin hypercube sampling (LHS) (McKay et al 1979). In contrast with deterministic approaches, which are generally only feasible in hyperrectangular , LHS can easily be combined with an acceptance–rejection step to generate space-filling designs on any finite subspace.
Typically, the theoretical domain of is unbounded (eg, ), so the success of a space-filling design is driven by the judicious choice of an appropriate . This issue is akin to specifying the domain for a numerical partial differential equation (PDE) solver and requires structural knowledge. For example, for a Bermudan put with strike , we might take , covering a wide swath of the ITM region, while eschewing OTM and very deep ITM scenarios. Ultimately, the problem setting determines the importance of this choice (compare the more straightforward setting of Section 6.3 below with that of Section 6.2, where finding a good is much harder).
4.2 Probabilistic designs
The original solution proposed by Longstaff and Schwartz (2001) was to construct the design using the conditional density of , that is, to generate independent draws , . This strategy has two advantages. First, it permits implementing the backward recursion for estimating on a fixed global set of scenarios that requires just a single (albeit very large) simulation; hence, it saves on the simulation budget. Second, a probabilistic design automatically adapts to the dynamics of , picking out the regions that is likely to visit and removing the aforementioned challenge of specifying .
Moreover, the empirical sampling of reflects the density of , which helps to lower the weighted loss (2.14). Assuming that the boundary is not in a region where is very low, this generates a well-adapted design, which partly explains why the experimental design of LSMC is not entirely without its merits. Compared with a space-filling design that is sensitive to , a probabilistic design is primarily driven by the initial condition . This can be a boon (as discussed) or a limitation; with OTM put contracts, for example, the vast majority of empirically sampled sites would be OTM, dramatically lowering the quality of an empirical . A good compromise is to take , that is, to effectively condition on being ITM.
4.3 Batched designs
As discussed, the design sites need not be distinct, and the macro design can be replicated. Such batched designs offer several benefits in the context of metamodeling. First, the averaging of the simulation outputs dramatically improves the statistical properties of the averaged compared with the raw . In most cases, once , the Gaussian assumption regarding the respective noise becomes excellent. Second, batching raises the signal-to-noise ratio and hence simplifies response surface modeling. Training the kriging kernel with very noisy samples is difficult, as the likelihood function to be optimized possesses multiple local maximums. Third, as mentioned in Section 3.1, batching allows the empirical estimation of heteroscedastic sampling noise . Algorithm 1 summarizes the steps for building a kriging-based RMC solver with a replicated design.
Batching also connects with the general bias–variance trade-off in statistical estimation. Plain (pointwise) Monte Carlo offers a zero bias/high-variance estimator of , which is therefore computationally expensive. A low-complexity model, such as parametric least squares (2.12), offers a low-variance/high-bias estimator (since it requires constraining ). Coupled with kriging, batched designs interpolate these two extremes by reducing bias through flexible approximation classes and reducing variance through the batching and modeling of spatial response correlation.
Remark 4.1.
The batch sizes can be adaptive. By choosing appropriately, one may set to any desired level. In particular, one could make all averaged observations homoscedastic with a prespecified intrinsic noise . The resulting regression model for would then conform very closely to classical homoscedastic Gaussian assumptions regarding the distribution of simulation noise.
ALGORITHM 1 Regression Monte Carlo for optimal stopping using kriging.
Require: number of simulations , number of replications , number of time steps .
1:
2: Set
3: for
4: Generate a macro design
5: Use the stochastic sampler in (2.8) with replications per
6: Batch the replications according to (3.7) to obtain the averaged
7: Fit a kriging metamodel based on
8: Obtain stopping set from (2.11)
9: (Or replace steps 4–8 with a sequential design subproblem using Algorithm 2)
10: end for
11: (Generate fresh out-of-sample and approximate using (2.9))
12: return Estimated stopping sets
As the number of replications becomes very large, batching effectively eliminates all sampling noise. Consequently, one can substitute the empirical average from (3.7) for the true , an approach first introduced by Van Beers and Kleijnen (2003). It now remains only to interpolate these responses, reducing the metamodeling problem to the analysis of deterministic experiments. There are a number of highly efficient interpolating metamodels, including classical deterministic kriging (which takes in (3.5)–(3.6)) and natural cubic splines. Assuming a smooth underlying response, deterministic metamodels often enjoy very fast convergence. Application of interpolators offers a new perspective (which is also more pleasing visually) on the RMC metamodeling problem. To sum up, batching offers a tunable spectrum of methods that blend the smoothing and interpolation of .
Figure 2 illustrates the above idea for a one-dimensional Bermudan put in a GBM model (see Section 4.4). We construct a design of macro size , which uses just six distinct sites , with replications at each site. We then interpolate the obtained values using (i) a deterministic kriging model and (ii) a natural cubic spline. We observe that the resulting fit for , shown in Figure 2, looks excellent and yields an accurate approximation of the exercise boundary.
4.4 Illustration
To illustrate different simulation designs, we revisit the classic example of a one-dimensional Bermudan put option in a geometric Brownian motion model. More extensive experiments are given in Section 6. The lognormal -dynamics are
(4.1) |
and the put payoff is . The option matures at , and we assume twenty-five exercise opportunities spaced out evenly with . This means that, with a slight abuse of notation, the RMC algorithm is executed over . The rest of the parameter values are interest rate , dividend yield , volatility and at-the-money (ATM) strike . In this case, (see Longstaff and Schwartz 2001). In one dimension, the stopping region for the Bermudan put is an interval ; thus, there is a unique exercise boundary .
We implement Algorithm 1 using a batched LHS design with , , so that at each step there are distinct simulation sites . The domain for the experimental designs is the ITM region . To focus on the effect of the various instances of , we freeze the kriging kernel across all time steps as a Matérn 5/2-type (3.4) with hyperparameters , which were close to the MLE estimates obtained. This choice is maintained for the rest of this section as well as for Figures 4 and 5. Our experiments (see, in particular, Table 3) imply that the algorithm is insensitive to both the kernel family and the specific software/approach used for training the kriging kernels.
Figure 3 illustrates the effect of various experimental designs on the kriging metamodels. We compare three different batched designs with a fixed overall size . These designs are (1) an LHS design with small batches , , as illustrated in panels (a), (d) and (g); (2) an LHS design with a large , , as illustrated in panels (b), (e) and (h); and (3) a probabilistic density-based design also with , , as illustrated in panels (c), (f) and (i). Recall that the latter generates paths of and then creates subsimulations initiating at each . The resulting at are shown in panels (a)–(c) of Figure 3, along with the obtained estimates . Panels (d)–(f) of Figure 3 show the resulting surrogate standard deviations . The shape of is driven by the local density of the respective as well as the simulation noise . Here, is highly heteroscedastic, being much larger near ATM than deep ITM. This is because ITM one is likely to stop sooner, and so the variance of is lower. For LHS designs, the roughly uniform density of leads to the posterior variance being proportional to the simulation noise, ; in contrast, the empirical design reflects the higher density of closer to , so that is smallest there. Moreover, because there are very few design sites for (just five in Figure 3), the corresponding surrogate variance has a distinctive hill-and-valley shape. In all cases, the surrogate variance explodes along the edges of , reflecting the lack of information about the response there.
Panels (g)–(i) of Figure 3 show the local loss metric for the corresponding designs. We stress that is a function of the sampled and hence will vary across algorithm runs. Intuitively, is driven by the respective surrogate variance , weighted in terms of the distance to the stopping boundary. Consequently, large is fine for regions far from the exercise boundary. Overall, we can make the following observations.
- (i)
Replicating simulations does not materially impact on surrogate variance at any given and hence has little effect on . Consequently, there is minimal loss of fidelity relative to using a lot of distinct design sites in .
- (ii)
The modeled response surfaces tend to be smooth. Consequently, the kernel should enforce long-range spatial dependence (via a sufficiently large length scale) to ensure that is likewise smooth. Undesirable fluctuations in can generate spurious stopping/continuation regions (see, for example, panel (a) in Figure 3 around ).
- (iii)
Probabilistic designs tend to increase because they fail to place enough design sites deep ITM, leading to extreme posterior uncertainty there (see panel (i) in Figure 3).
- (iv)
Due to the very low signal-to-noise ratio that is common in financial applications, replication of the simulation aids in seeing the “shape” of and hence simplifies the task of metamodeling (see point (ii)).
To give a sense of the aggregate quality of the metamodels, Figure 4 shows the estimated at three different steps as the RMC implements backward recursion. The overall shape of remains essentially the same, being strongly positive close to the strike , crossing zero at and remaining slightly negative deep ITM. As decreases, the exercise boundary also moves lower. Figure 4 shows how the cross-sectional borrowing of information (modeled by the Gaussian process correlation kernel ) reduces the surrogate variance compared with the raw (see the corresponding 95% credibility intervals (CIs)). We also observe some undesirable “wiggling” of , which can generate spurious exercise boundaries such as in the extreme left of Figure 4(c). However, with an LHS design, this effect is largely mitigated compared with the performance of the empirical design in Figure 3. This confirms the importance of spacing out the design . Indeed, for the LHS designs, the phenomenon of spurious exercising only arises for and very small , whereby the event that would lead to the wrong exercise strategy has a negligible probability of occurring.
5 Sequential designs
In this section, we discuss adaptive designs that are generated on the fly as the metamodel learns . Sequential design captures the intuition of focusing the sampling efforts around the exercise boundary (see the loss function (2.14)). This makes learning the response intrinsic to constructing the experimental design. In particular, Bayesian procedures embed sequential design within a dynamic programming framework that is naturally married to statistical learning. Algorithmically, sequential design is implemented at each time step by introducing an additional loop over that grows the experimental designs now indexed by their size ( is the number of distinct sites, with batching possibly applied in the inner loop as before). As the designs are augmented, the corresponding surrogate surfaces and stopping regions are reestimated and refined. The final result is a sequence of adaptive designs and corresponding exercise policy.
5.1 Augmenting the experimental design
Fixing a time step , the sequential design loop is initialized with , generated, for example, using LHS. New design sites are then iteratively added by greedily optimizing an acquisition function that quantifies information gains via expected improvement (EI) scores. The aim of EI scores is to identify the locations that are most promising in terms of lowering the global loss function from (2.14). In our context, the EI scores are based on the posterior distributions , which summarize information learned so far about . The seminal works of Cohn (1996) and MacKay (1992) showed that a good heuristic for maximizing learning rates is to sample at sites that have high surrogate variance or high expected surrogate variance reduction, respectively. Because we target (2.14), a second objective is to preferentially explore regions close to the contour . This is achieved by blending the distance to the contour with the above variance metrics, in analogy to the efficient global optimization approach (Jones et al 1998) in simulation optimization.
A greedy strategy augments with the location that maximizes the acquisition function
(5.1) |
Two major concerns for implementing (5.1) are (i) the computational cost of optimizing over and (ii) the danger of myopic strategies. For the first aspect, we note that (5.1) introduces a whole new optimization subproblem just for augmenting the design. This can generate substantial overhead that deteriorates the running time of the entire RMC. Consequently, approximate optimality is a possible pragmatic compromise to maximize performance. For the second aspect, the myopic nature of (5.1) might lead to an undesirable concentration of the design interfering with the convergence of as grows. It is well known that many greedy sequential schemes can get trapped sampling in some subregion of , generating poor estimates elsewhere. For example, if there are multiple exercise boundaries, the EI metric might over-prefer established boundaries at the risk of missing other stopping regions that have not yet been found. A common way to circumvent this concern is to randomize (5.1), thus ensuring that grows dense uniformly on . In the examples below we follow Gramacy and Ludkovski (2015) and replace in (5.1) with , where is a finite candidate set, generated using LHS over some domain . LHS candidates ensure that potential new locations are representative and well spaced out over .
Kriging metamodels are especially well suited to sequential design thanks to the availability of simple updating formulas that allow us to efficiently assimilate new data points into an existing fit. If a new sample is added to an existing design while keeping the kernel fixed, the surrogate mean and variance at location are updated via
(5.2) | ||||
(5.3) |
where is a weight function (that is available analytically) specifying the influence of the new -sample on , conditional on existing design locations . Note that (5.2) and (5.3) require only the knowledge of the latest surrogate mean/variance and . Previous simulation outputs do not need to be stored. Moreover, the updated surrogate variance is a deterministic function of that is independent of . In particular, the local reduction in surrogate standard deviation at is proportional to the current (Chevalier et al 2014):
(5.4) |
Remark 5.1.
In principle, the hyperparameters of the kernel ought to be reestimated as more data is collected, which makes updating nonsequential. However, since we expect the estimated to converge as grows, it is a feasible strategy to keep frozen across sequential design iterations, allowing the use of (5.2)–(5.3).
Algorithm 2 summarizes a sequential design for learning at a given time step ; see also Algorithm 1 for the overall kriging-based RMC approach to optimal stopping.
ALGORITHM 2 Sequential design for (2.14) using stochastic kriging.
Require: Initial design size , final size , acquisition function
1: Generate initial design of size using LHS over
2: Sample and initialize the response surface model
3: Construct the stopping set using (2.11)
4:
5: while do
6: Generate a new candidate set of size using LHS
7: Compute the expected improvement for each
8: Pick a new location and sample the corresponding
9: (Optional) Reestimate the kriging kernel
11: Update the stopping set using (2.11)
12: Save the overall grid
13:
14: end while
15: return Estimated stopping set and design
Remark 5.2.
The ability to quickly update the surrogate as simulation outputs are collected lends kriging models to “online” and parallel implementations. This can be useful even without going through a full sequential design framework. For example, one can pick an initial budget , implement RMC with simulations and – if the results are not sufficiently accurate – add more simulations without having to completely restart from scratch. Similarly, batching simulations is convenient for parallelizing using graphics processing unit (GPU) technology.
5.2 Acquisition functions
In this section, we propose several acquisition functions to guide the sequential design subproblem (5.1). Throughout we are cognizant of the loss function (2.14) that is the main criterion for learning .
One possible proposal for an EI metric is to sample at locations that have a high (weighted) local loss defined in (3.8), that is,
(5.5) |
The superscript “ZC” stands for zero-contour, since intuitively measures the weighted distance between and the exercise boundary (see Gramacy and Ludkovski 2015). targets regions with high or regions that are close to the exercise boundary, .
A more refined version, dubbed ZC–SUR, attempts to identify regions where can be quickly lowered through additional sampling. To do so, ZC–SUR incorporates the expected difference , which can be evaluated using the updating formulas (5.2) and (5.3). This is similar in spirit to the stepwise uncertainty reduction (SUR) criterion in stochastic optimization proposed by Picheny et al (2010). Plugging in (2.14), we obtain
(5.6) |
Figure 5 illustrates the ZC–SUR algorithm in the two-dimensional max-call setting described in Section 6.2. Panel (a) shows the initial Sobol macro design of sites. Panel (b) shows the augmented design . At each site, a batch of replications is used to reduce intrinsic noise. We observe that the algorithm aggressively places most of the new design sites around the zero-contour of , primarily lowering the surrogate variance (ie, maximizing the accuracy) in that region.
The empirical integrated local loss from (3.9) provides an indicator of algorithm performance by quantifying the accuracy of . This is particularly relevant for the sequential design approach, where one is able to track as the designs are augmented. Figure 6 shows the evolution of as a function of the macro design size for two different time steps as well as the same two-dimensional max-call contract. We observe that is generally decreasing in ; the SUR criterion intuitively makes the latter a random walk with negative drift given by . We also observe that approximation errors are larger for later time steps, that is, is increasing. This is because is based on the underlying distribution of ; for larger values, the volume of the effective support for the latter grows, lowering the density of the design sites. This increases surrogate variance and, in turn, raises the local loss in (3.8).
While the values of alone are not sufficient to yield a confidence interval for the option price (due to the complex error propagation), self-assessment allows the possibility of the adaptive termination of Algorithm 1. For example, one could implement Algorithm 1 until for a prespecified tolerance level. In view of the above discussion, this would lead to larger experimental designs for close to , which would counteract error propagation in particular.
6 Numerical experiments
We proceed to implement our RMC methods for three different types of Bermudan option in dimensions . Our examples are based on previously published parameter sets, allowing direct comparison of the methods’ relative accuracy. To benchmark our approach, we compare against two implementations of the conventional LSMC, namely, global polynomial bases and local piecewise linear bases following the method of Bouchard and Warin (2011). The latter consists of subdividing into equiprobable rectangular subdomains and separately regressing against in each cell. This strategy does not require user-specified basis function selection for (2.12), although it is sensitive to the chosen number of subdomains. In each case study, to enable an “apples-to-apples” comparison of different RMC methods, all approximations of are evaluated on a fixed out-of-sample set of paths of .
6.1 Benchmarked examples
While our first visualizations were in one dimension, to offer more realistic setups we use two-dimensional and three-dimensional models as benchmarks. In both cases, we work with iid geometric Brownian motion factors . As a first example, we consider a two-dimensional basket put, where each asset follows the GBM dynamics (4.1) under the same parameters as in Section 4.4, namely, , , , , . The payoff is
with and , leading to an option value of . For our second example, we consider a three-dimensional rainbow or max-call payoff, . The parameters are from Andersen and Broadie (2004): , , , , , , , with an option value of .
As a first step, Tables 1 and 2 show the impact of different experimental designs. We concentrate on the space-filling designs, evaluating the randomized LHS design as well as two low-discrepancy sequences (Halton and Sobol), varying the batch sizes . For completeness, we also compare them with the probabilistic density-based design and a sequential (namely ZC–SUR) DoE strategy. For the basket put (see Table 1), the designs operate with the triangular input space , which covers the relevant ITM region of the contract, and we use (the QMC sequences were generated in a cube and then restricted to the triangular subset). For the max call (see Table 2), the designs are on the rectangular region with . In both examples, we purposely choose a very limited simulation budget ( for the two-dimensional put and for the three-dimensional call) to accentuate the difference between the methods.
Not surprisingly, fully capturing the shape of the exercise boundary with just a handful of design sites (even in the idealized setting where one obtains a noise-free sample of at an optimal collection ) is impossible. Nevertheless, compared with the tens of thousands of design sites involved in LSMC, being able to get away with a couple of hundred macrosites seems almost magical. In this example, we also see that the probabilistic design performs very well, partly because the put was originally ATM. However, the performance of the LHS design is a little bit off, and the performance of the two low-discrepancy sequences is essentially the same. We stress that differences of less than 0.1 cents in are negligible, so the main conclusion from Table 1 is that the precise structure of a space-filling DoE seems to play only a minor role.
Relative to a conventional LSMC approach, the more sophisticated kriging metamodel introduces significant regression overhead. With a standard RMC, about 95% of simulation time is spent on path generation and the rest on regression; in our approach, this allocation is turned on its head with more than 90% of the simulation time being devoted to regression. This becomes even more extreme in sequential methods, where the time cost of simulations becomes essentially negligible. The regression cost arises due to the complexity of making predictions (3.5) with the kriging model; as the size of the macro design increases, this becomes the dominant expense of the whole algorithm. As a result, the running time of our method is very sensitive to . By way of comparison, at a single run for the two-dimensional put problem took 8 seconds in our computing environment, at it took 20 seconds, and at it took 190 seconds. The situation is even more acute for the sequential DoE, which is an order of magnitude slower. For this reason, we did not run (which takes many minutes) in that case. The above facts indicate that batching is essential for ensuring the adequate performance of the kriging RMC. A full investigation into the optimal amount of batching (the choice of in (3.7)) is beyond the scope of this paper. However, based on our experience, we suggest that would be appropriate in a typical financial context, offering a good trade-off between computational speed up from replication and maintaining an adequate number of distinct design sites.
Design/batch size | |||
---|---|---|---|
Probabilistic | 1.458 (0.002) | 1.448 (0.003) | 1.443 (0.006) |
LHS | 1.453 (0.002) | 1.446 (0.004) | 1.416 (0.033) |
Sobol QMC | 1.454 (0.002) | 1.448 (0.002) | 1.454 (0.002) |
Halton QMC | 1.456 (0.003) | 1.447 (0.003) | 1.452 (0.003) |
Sequential ZC–SUR | – | 1.438 (0.003) | 1.450 (0.003) |
Design/batch size | ||
---|---|---|
Probabilistic | 11.115 (0.015) | 11.128 (0.015) |
LHS | 11.071 (0.025) | 11.110 (0.029) |
Sobol QMC | 11.131 (0.020) | 11.177 (0.015) |
Halton QMC | 11.120 (0.017) | 11.183 (0.017) |
Sequential ZC–SUR | 11.160 (0.025) | 11.147 (0.014) |
Table 3 illustrates the effect of choosing different kriging kernels ; we consider the Matérn 5/2 (3.4), the Gaussian/squared exponential (3.3) and the Matérn 3/2 families (Roustant et al 2012). In each case, kernel hyperparameters are fitted in a black-box manner by the DiceKriging software, and the underlying design was space-filling using the Sobol sequence. As expected, optimizing the kernel gives more accurate results, improving the previous, fixed-kernel answers presented in Tables 1 and 2. It also adds to the simulation time by 20%. The other conclusion is that the choice of kernel family has a negligible effect on performance. Hence, for the rest of the experiments we work with the default choice of a squared exponential kernel.
Kernel | 2D basket put | 3D max call |
---|---|---|
Matérn 5/2 | 1.452 (0.00243) | 11.191 (0.0139) |
Matérn 3/2 | 1.450 (0.00252) | 11.191 (0.0136) |
Gaussian | 1.453 (0.00273) | 11.172 (0.0160) |
From the above experiments, the following guidance can be given for implementing kriging RMC. Select an value between 50 and 100 that is sufficient to estimate reasonably well and reduce the macro design size significantly. Also, use whichever kernel family and hyperparameter optimization the software recommends. As a goodness-of-fit check, examine posterior variance along the estimated stopping boundary . It should not be too large.
6.2 Scalability in state dimension
The max-call setting is convenient for investigating the scalability of our approaches in dimension of the problem. In contrast to basket put, the stopping region of a max call consists of several disconnected pieces. Namely, it is optimal to stop if exactly one asset is ITM while all other assets are below the strike . This generates disconnected stopping subregions in dimension . One consequence of this is that selecting a good domain for a space-filling design, such as one of the low-discrepancy sequences, is difficult since the continuation region is nonconvex and the stopping region is disconnected. We use , which encompasses the relevant parts of both the stopping and continuation regions; hence, in this situation, we build our metamodel for both ITM and OTM paths. The latter is rather large, hurting the accuracy of the space-filling designs.
Following Andersen and Broadie (2004), we benchmark for . Table 4 shows the results from two different implementations of the LSMC algorithm (namely, a global polynomial (“Poly”) basis as well as the localized Bouchard–Warin (“BW11”) scheme (Bouchard and Warin 2011)). It also compares them with two kriging metamodels, one using a space-filling design utilizing the low-discrepancy Sobol sequence (“KrigSobol”) and the other a sequential ZC–SUR design (“KrigSUR”). To show the advantage of our approach, we use a design for KrigSobol that is an order-of-magnitude smaller, and an even smaller one for the adaptive KrigSUR method. For large , the standard LSMC approach requires hundreds of thousands of -trajectories, which may impose memory constraints. These savings are thus impressive.
In terms of the overall simulation budget measured by the number of one-step simulations of (see the fourth column of Table 4), the use of better experimental designs and kriging metamodels enables a factor of two to five savings, which is smaller than the reduction in because we do not exploit the “global grid” trick of the conventional approach. In the last five-dimensional example, we also see the limitation of the kriging model, in the form of the excessive regression overhead incurred as the macro design size rises. As a result, despite using less than 20% of the simulation budget, the overall running time is much longer than with the LSMC. With the present implementation using DiceKriging, this overhead crowds out simulation savings around . Table 4 suggests that, while conceptually attractive, the gains from a sequential design strategy are marginal in terms of memory and are expensive in terms of additional overhead. To rephrase, one can squeeze most of the savings via a space-filling design. This observation offers pragmatic guidance for the DoE aspect, indicating that one must balance simulation-budget savings against regression/DoE overhead. Of course, this trade-off keeps changing as more efficient large-scale kriging/RSM software is developed and implemented, see, eg, the localized kriging algorithm described in Gramacy and Apley (2015).
(a) 2D max call | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC Poly, | 7.93 | (0.018) | 3.60 | 3.9 |
LSMC BW11, | 7.89 | (0.023) | 3.60 | 4.0 |
KrigSobol, | 8.12 | (0.019) | 2.54 | 5.3 |
KrigSUR, | 8.10 | (0.024) | 1.44 | 10.8 |
(b) 3D max call | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC Poly, | 10.95 | (0.01) | 2.70 | 15 |
LSMC BW11, | 11.12 | (0.01) | 2.70 | 22 |
KrigSobol, | 11.18 | (0.02) | 0.48 | 33 |
KrigSUR, | 11.15 | (0.02) | 0.51 | 161 |
(c) 5D max call | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC Poly, | 15.81 | (0.01) | 7.20 | 42 |
LSMC BW11, | 16.32 | (0.02) | 5.76 | 87 |
KrigSobol, | 16.31 | (0.02) | 0.86 | 205 |
KrigSUR, | 16.30 | (0.02) | 0.66 | 666 |
(a) Rambharat and Brockwell (2010); case SV5 | ||||
---|---|---|---|---|
, , , , | ||||
, , , , , | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC BW11, | 15.89 | (0.08) | 2.50 | 24 |
LSMC BW11, | 16.03 | (0.05) | 6.25 | 52 |
KrigLHS, | 15.58 | (0.22) | 0.96 | 25 |
KrigLHS, | 16.06 | (0.05) | 4.25 | 155 |
KrigSUR, | 16.24 | (0.04) | 2.32 | 283 |
KrigSUR, | 16.30 | (0.05) | 5.69 | 565 |
(b) Agarwal et al (2016) ITM; | ||||
, , , , | ||||
, , , , , | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC BW11, | 14.82 | (0.03) | 1.00 | 36 |
KrigLHS, | 14.73 | (0.05) | 0.26 | 13 |
KrigLHS, | 14.81 | (0.02) | 1.05 | 85 |
KrigSUR, | 14.91 | (0.03) | 0.25 | 50 |
KrigSUR, | 14.96 | (0.02) | 0.58 | 121 |
(c) Agarwal et al (2016) OTM; | ||||
No. of | Time | |||
Method | (SD) | Sims | (s) | |
LSMC BW11, | 8.25 | (0.02) | 1.25 | 26.7 |
LSMC BW11, | 8.30 | (0.02) | 3.12 | 67.3 |
KrigLHS, | 8.27 | (0.02) | – | – |
KrigLHS, | 8.30 | (0.01) | – | – |
6.3 Stochastic volatility examples
Lastly, we discuss stochastic volatility (SV) models. Because there are no (cheap) exact methods for simulating the respective asset prices, discretization methods such as the Euler scheme are employed. This makes this class a good case study on settings where simulation costs are more significant. Our case study uses the mean-reverting SV model with a two-dimensional state :
(6.1) |
where is the asset price and is the log volatility factor. Volatility follows a mean-reverting stationary Ornstein–Uhlenbeck process with mean-reversion rate and base level , and a leverage effect expressed via the correlation . Simulations of are based on an Euler scheme with a time step . We consider the asset put ; exercise opportunities are spaced out by . Related experiments have been carried out by Rambharat and Brockwell (2010) and recently by Agarwal et al (2016).
Table 5 lists three different parameter sets. Once again, we consider the kriging metamodels (here, implemented with LHS) compared with LSMC implementations based on the BW11 algorithm (Bouchard and Warin 2011). Table 5 confirms that the combination of kriging and thoughtful simulation design allows us to reduce by an order of magnitude. Compare, for the first parameter set, a kriging model with LHS design that uses (and estimates the option value at 16.06) with the LSMC model with (which estimates the option value to be 16.05). For the set of examples used in Agarwal et al (2016), we observe that the space-filling design does not perform as well, which is most likely due to the inefficiency of its rectangular LHS domain . Nonetheless, the sequential design method wins handily.
Table 5 showcases another advantage of building a full-scale metamodel for the continuation values. Because we obtain the full map , our method can immediately generate an approximate exercise strategy for a range of initial conditions. By contrast, in conventional LSMC, all trajectories start from a fixed . In Table 5, we illustrate this capability by repricing the put from Agarwal et al (2016) after changing to an OTM initial condition (keeping ). For the kriging approach, we use the same metamodels that were built for the ITM case (), meaning the policy sets remain the same. Thus, all that was needed to obtain the reported put values was a different set of test trajectories to generate fresh out-of-sample payoffs. We note that the obtained results (compare with a budget of to with ) still beat the reestimated LSMC solution by several cents.
7 Conclusion
We investigated the use of kriging metamodels for policy approximation based on (2.11) for optimal stopping problems. As shown, kriging allows a flexible modeling of the respective continuation values, internally learning the spatial structure of the model. Kriging is also well suited to a variety of DoE approaches, including batched designs that alleviate non-Gaussianity in noise distributions, and adaptive designs that refine the local accuracy of the metamodel in regions of interest (while demanding a localized, nonparametric fit). These features translate into savings of between 50 and 80% of the simulation budget. While the time savings are often negative, I believe that kriging is a viable and attractive regression approach to RMC. First, in many commercial-grade applications, memory constraints are as important as speed constraints. Speed is generally much more scalable/parallelizable than memory. Second, as implemented, the main speed bottleneck comes from the cost of building a kriging model; alternative formulations get around this scalability issue, and further speed ups can be expected down the pike. Third, kriging brings a suite of diagnostic tools, such as a transparent credible interval for the fitted that can be used to self-assess accuracy. Fourth, thanks to its analytic structure, kriging is well suited to automated DoE strategies.
Yet another advantage of kriging is the consistent probabilistic framework that it offers for the joint modeling of and its derivatives. The latter is of course crucial for hedging; Delta hedging, in particular, is related to the derivative of the value function with respect to . In the continuation region we have , so a natural estimator for Delta is the derivative of the metamodel surrogate (see, for example, Wang and Caflisch 2010; Jain and Oosterlee 2015). For kriging metamodels, the posterior distribution of the response surface derivative is again Gaussian and thus available analytically, with formulas similar to (3.5) and (3.6). We refer to Chen et al (2013) for details, including an example of applying kriging to computing the Delta of European options. The application of this approach to American-style contracts will be explored separately.
The presented metamodeling perspective modularizes DoE and regression so that one can mix and match each element. Consequently, one can swap kriging with another strategy or, alternatively, use a conventional LSMC with a space-filling design. In the opinion of the author, this is an important insight, highlighting the range of choices that can and need to be made available within the broader RMC framework. In a similar vein, by adding a separate preaveraging step that resembles a nested simulation or Monte Carlo forest scheme (see Bender et al 2008), our proposal for batched designs highlights the distinction between smoothing (removing simulation noise) and interpolation (predicting at new input sites), which are typically lumped together in regression models. To this end, kriging metamodels admit a natural transition between the modeling of deterministic (noise-free) and stochastic simulators, and they are well suited to batched designs.
Kriging can also be used to build a classification-like framework that swaps out cross-sectional regression to fit with a direct approach to estimating . One solution is to convert the continuous-valued path payoffs into labels, , and build a statistical classification model for , eg, a kriging probit model. Then, . Modeling rather than might alleviate much of the ill-behavior in the observation noise . These ideas are left for future work.
In our experiments, we purposely used a generic DoE strategy with minimal fine-tuning. In fact, the DoE perspective suggests a number of further enhancements for implementing RMC. Recall that the original optimal stopping formulation is reduced to iteratively building metamodels across the time steps. These metamodels are, of course, highly correlated, offering opportunities for “warm starts” in the corresponding surrogates. The warm start can be used both for constructing adaptive designs (eg, a sort of importance sampling to preferentially target the exercise boundary of ) and for constructing the metamodel (eg, to better train the kriging hyperparameters ). Conversely, one can apply different approaches to the different time steps, such as building experimental designs of varying size , or shrinking the surrogate domain as . These ideas will be explored in a separate forthcoming paper.
As mentioned, ideally the experimental design is to be concentrated along the stopping boundary. While this boundary is a priori unknown, partial information is frequently available (with Bermudan puts, for instance, one has rules of thumb that the stopping boundary is moderately ITM). This could be combined with an importance sampling strategy to implement the path-generation approach of a more standard LSM while generating a more efficient design (see Del Moral et al (2012), Juneja and Kalra (2009) and Gobet and Turkedjiev (2017) for various aspects of importance sampling for optimal stopping). Another variant would be a “double importance sampling” procedure of first generating a nonadaptive (say, density-based) design and then adding (nonsequentially) design points/paths in the vicinity of estimated .
Declaration of interest
The author reports no conflicts of interest. The author alone is responsible for the content and writing of the paper.
Acknowledgements
I am grateful to the anonymous referee and the editor for many useful suggestions, which have improved the final version of this work. Work partially supported by NSF CDSE-1521743.
References
- Agarwal, A., Juneja, S., and Sircar, R. (2016). American options under stochastic volatility: control variates, maturity randomization & multiscale asymptotics. Quantitative Finance 16(1), 17–30 (https://doi.org/10.1080/14697688.2015.1068443).
- Andersen, L., and Broadie, M. (2004). A primal–dual simulation algorithm for pricing multi-dimensional American options. Management Science 50, 1222–1234 (https://doi.org/10.1287/mnsc.1040.0258).
- Ankenman, B., Nelson, B. L., and Staum, J. (2010). Stochastic kriging for simulation metamodeling. Operations Research 58, 371–382 (https://doi.org/10.1287/opre.1090.0754).
- Belomestny, D., Dickmann, F., and Nagapetyan, T. (2015). Pricing Bermudan options via multilevel approximation methods. SIAM Journal on Financial Mathematics 6, 448–466 (https://doi.org/10.1137/130912426).
- Bender, C., Kolodko, A., and Schoenmakers, J. (2008). Enhanced policy iteration for American options via scenario selection. Quantitative Finance 8(2), 135–146 (https://doi.org/10.1080/14697680701253013).
- Bouchard, B., and Warin, X. (2011). Monte Carlo valorisation of American options: facts and new algorithms to improve existing methods. In Numerical Methods in Finance, Carmona, R., Moral, P. D., Hu, P., and Oudjane, N. (eds). Springer Proceedings in Mathematics, Volume 12. Springer.
- Boyle, P. P., Kolkiewicz, A. W., and Tan, K. S. (2013). Pricing Bermudan options using low-discrepancy mesh methods. Quantitative Finance 13(6), 841–860 (https://doi.org/10.1080/14697688.2013.776699).
- Carrière, J. F. (1996). Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics & Economics 19(1), 19–30 (https://doi.org/10.1016/S0167-6687(96)00004-2).
- Chaudhary, S. K. (2005). American options and the LSM algorithm: quasi-random sequences and Brownian bridges. The Journal of Computational Finance 8(4), 101–115 (https://doi.org/10.21314/JCF.2005.132).
- Chen, X., Ankenman, B. E., and Nelson, B. L. (2013). Enhancing stochastic kriging metamodels with gradient estimators. Operations Research 61(2), 512–528 (https://doi.org/10.1287/opre.1120.1143).
- Chevalier, C., Ginsbourger, D., and Emery, X. (2014). Corrected kriging update formulae for batch-sequential data assimilation. In Mathematics of Planet Earth, Pardo-Igúzquiza, E., Guardiola-Albert, C., Heredia, J., Moreno-Merino, L., Durán, J., and Vargas-Guzmán, J. (eds), pp. 119–122. Springer (https://doi.org/10.1007/978-3-642-32408-6_29).
- Cohn, D. A. (1996). Neural network exploration using optimal experiment design. Neural Networks 9(6), 1071–1083 (https://doi.org/10.1016/0893-6080(95)00137-9).
- Del Moral, P., Hu, P., and Oudjane, N. (2012). Snell envelope with small probability criteria. Applied Mathematics & Optimization 66(3), 309–330 (https://doi.org/10.1007/s00245-012-9173-1).
- Gobet, E., and Turkedjiev, P. (2017). Adaptive importance sampling in least-squares Monte Carlo algorithms for backward stochastic differential equations. Stochastic Processes and Applications 127(4), 1171–1203 (https://doi.org/10.1016/j.spa.2016.07.011).
- Gramacy, R. (2007). tgp: an R package for treed Gaussian process models. Journal of Statistical Software 19(9), 1–48 (https://doi.org/10.18637/jss.v019.i09).
- Gramacy, R., and Apley, D. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics 24(2), 561–578 (https://doi.org/10.1080/10618600.2014.914442).
- Gramacy, R., and Ludkovski, M. (2015). Sequential design for optimal stopping problems. SIAM Journal on Financial Mathematics 6(1), 748–775 (https://doi.org/10.1137/140980089).
- Gramacy, R., and Polson, N. (2011). Particle learning of Gaussian process models for sequential design and optimization. Journal of Computational and Graphical Statistics 20, 102–118 (https://doi.org/10.1198/jcgs.2010.09171).
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer (https://doi.org/10.1007/978-0-387-84858-7).
- Hepperger, P. (2013). Pricing high-dimensional Bermudan options using variance-reduced Monte Carlo methods. The Journal of Computational Finance 16(3), 99–126 (https://doi.org/10.21314/JCF.2013.273).
- Jain, S., and Oosterlee, C. W. (2015). The stochastic grid bundling method: efficient pricing of Bermudan options and their Greeks. Applied Mathematics and Computation 269, 412–431 (https://doi.org/10.1016/j.amc.2015.07.085).
- Jones, D., Schonlau, M., and Welch, W. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4), 455–492 (https://doi.org/10.1023/A:1008306431147).
- Juneja, S., and Kalra, H. (2009). Variance reduction techniques for pricing American options using function approximations. The Journal of Computational Finance 12(3), 79–102 (https://doi.org/10.21314/JCF.2009.208).
- Kan, K. H., and Reesor, R. M. (2012). Bias reduction for pricing American options by least-squares Monte Carlo. Applied Mathematical Finance 19(3), 195–217 (https://doi.org/10.1080/1350486X.2011.608566).
- Kleijnen, J. P. (2015). Design and Analysis of Simulation Experiments, 2nd edn. Springer Science & Business Media, Volume 111. Springer (https://doi.org/10.1007/978-3-319-18087-8).
- Kohler, M. (2008). A regression-based smoothing spline Monte Carlo algorithm for pricing American options in discrete time. Advances in Statistical Analysis 92(2), 153–178 (https://doi.org/10.1007/s10182-008-0067-0).
- Kohler, M. (2010). A review on regression-based Monte Carlo methods for pricing American options. In Recent Developments in Applied Probability and Statistics, Devroye, L., Karasözen, B., Kohler, M., and Korn, R. (eds), pp. 37–58. Springer (https://doi.org/10.1007/978-3-7908-2598-5_2).
- Kohler, M., and Krzyżak, A. (2012). Pricing of American options in discrete time using least squares estimates with complexity penalties. Journal of Statistical Planning and Inference 142(8), 2289–2307 (https://doi.org/10.1016/j.jspi.2012.02.031).
- Létourneau, P., and Stentoft, L. (2014). Refining the least squares Monte Carlo method by imposing structure. Quantitative Finance 14, 495–507 (https://doi.org/10.1080/14697688.2013.787543).
- Longstaff, F., and Schwartz, E. (2001). Valuing American options by simulations: a simple least squares approach. Review of Financial Studies 14, 113–148 (https://doi.org/10.1093/rfs/14.1.113).
- MacKay, D. (1992). Information-based objective functions for active data selection. Neural Computation 4, 590–604 (https://doi.org/10.1162/neco.1992.4.4.590).
- McKay, M., Beckman, R., and Conover, W. (1979). Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245 (https://doi.org/10.1080/00401706.1979.10489755).
- Picheny, V., and Ginsbourger, D. (2013). A nonstationary space–time Gaussian process model for partially converged simulations. SIAM/ASA Journal on Uncertainty Quantification 1, 57–78 (https://doi.org/10.1137/120882834).
- Picheny, V., Ginsbourger, D., Roustant, O., Haftka, R. T., and Kim, N.-H. (2010). Adaptive designs of experiments for accurate approximation of a target region. Journal of Mechanical Design 132(7), 071008 (https://doi.org/10.1115/1.4001873).
- Rambharat, B. R., and Brockwell, A. E. (2010). Sequential Monte Carlo pricing of American-style options under stochastic volatility models. Annals of Applied Statistics 4, 222–265 (https://doi.org/10.1214/09-AOAS286).
- Riihimäki, J., and Vehtari, A. (2010). Gaussian processes with monotonicity information. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Teh, Y. W., and Titterington, M. (eds), pp. 645–652. Proceedings of Machine Learning Research, Volume 9. URL: https://bit.ly/2jqsxPl.
- Roustant, O., Ginsbourger, D., and Deville, Y. (2012). DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software 51(1), 1–55 (https://doi.org/10.18637/jss.v051.i01).
- Santner, T. J., Williams, B. J., and Notz, W. I. (2013). The Design and Analysis of Computer Experiments. Springer Series in Statistics. Springer.
- Stentoft, L. (2004). Assessing the least squares Monte Carlo approach to American option valuation. Review of Derivatives Research 7(2), 129–168 (https://doi.org/10.1023/B:REDR.0000031176.24759.e6).
- Stentoft, L. (2014). Value function approximation or stopping time approximation: a comparison of two recent numerical methods for American option pricing using simulation and regression. The Journal of Computational Finance 18(1), 65–120 (https://doi.org/10.21314/JCF.2014.281).
- Tompaidis, S., and Yang, C. (2014). Pricing American-style options by Monte Carlo simulation: alternatives to ordinary least squares. The Journal of Computational Finance 18(1), 121–143 (https://doi.org/10.21314/JCF.2014.279).
- Tsitsiklis, J. N., and Van Roy, B. (2001). Regression methods for pricing complex American-style options. IEEE Transactions on Neural Networks 12(4), 694–703 (https://doi.org/10.1109/72.935083).
- van Beers, W. C. M., and Kleijnen, J. P. C. (2003). Kriging for interpolation in random simulation. Journal of the Operational Research Society 54(3), 255–262 (https://doi.org/10.1057/palgrave.jors.2601492).
- Wang, Y., and Caflisch, R. (2010). Pricing and hedging American-style options: a simple simulation-based approach. The Journal of Computational Finance 13, 95–125 (https://doi.org/10.21314/JCF.2010.220).
- Williams, C. K., and Rasmussen, C. E. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
- Yang, C., and Tompaidis, S. (2013). An iterative simulation approach for solving stochastic control problems in finance. Working Paper, Social Science Research Network (https://doi.org/10.2139/ssrn.2295591).
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net