On Optimal Weighting Scheme in Model Averaging
Model averaging is an alternative to model selection and involves assigning weights to different models. A natural question that arises is whether there is an optimal weighting scheme. Various authors have shown their existence in others methodological frameworks. This paper investigates the derivation of optimal weights for model averaging using square error loss. It is shown that though these weights may exist in theory and depend on model parameters; once estimated they are no longer optimal. It is demonstrated using an example of linear regression that model averaging estimators with these estimated weights are unlikely to outperform post-model selection and others model averaging estimators. We provide a theoretical justification for this phenomenon.
At a glance: Figures
Keywords: model averaging, model selection, optimal weight, square error loss, model uncertainty
American Journal of Applied Mathematics and Statistics, 2014 2 (3),
Received April 14, 2014; Revised May 07, 2014; Accepted May 13, 2014Copyright © 2013 Science and Education Publishing. All Rights Reserved.
Cite this article:
- Nguefack-Tsague, Georges. "On Optimal Weighting Scheme in Model Averaging." American Journal of Applied Mathematics and Statistics 2.3 (2014): 150-156.
- Nguefack-Tsague, G. (2014). On Optimal Weighting Scheme in Model Averaging. American Journal of Applied Mathematics and Statistics, 2(3), 150-156.
- Nguefack-Tsague, Georges. "On Optimal Weighting Scheme in Model Averaging." American Journal of Applied Mathematics and Statistics 2, no. 3 (2014): 150-156.
|Import into BibTeX||Import into EndNote||Import into RefMan||Import into RefWorks|
In most statistical modeling applications, there are several models that are a priori plausible. It is quite common nowadays to apply some model selection procedure to select a single one. Overviews, explanations, discussion and examples of such methods may be found in the books by Linhart and Zucchini , McQuarrie and Tsai , Burnham and Anderson  and Claeskens and Hjort .
An alternative to select a single model for estimation purposes is to give weights to all plausible models and to work with the resulting weighted estimator. This leads to the class of model averaging estimators. Once decided upon the weights (these can be the result of a model selection criterion such as Akaike’s information criterion (AIC), or arising from Bayesian motivations), the problem is not so much with the construction of the estimator, as with its properties.
Since model selection corresponds to the special case of assigning weight one to the selected model and weight zero to all other considered models, the question is equally relevant for estimators obtained after model selection. We refer to these estimators as post-model selection estimators (PMSE). The fact that selection was data-based is often ignored in the subsequent analysis and leads to invalid inferences. Literature on this topic includes but is not limited to Bancroft  for pre-test estimators, Breiman , Hjorth , Chatfield , Draper , Buckland et al. , Zucchini , Candolo et al. , Hjort and Claeskens , Efron , Leeb and Pötscher , Longford , Claeskens and Hjort , Schomaker et al. , Zucchini et al. , Liu and Yang , Nguefack-Tsague and Zucchini , Nguefack-Tsague et al. , and Nguefack-Tsague [22, 23, 24, 25]. Bayesian model averaging can be found in Hoeting et al.  and Wasserman . Wang et al.  provide a review of frequentist model averaging estimators.
Many optimal weighting schemes have evolved recently for model averaging. Hansen  discusses the model averaging in least squares estimation and proposes a method that selects the weights by minimizing Mallows’ criterion. Furthermore, Hansen  suggests to use Mallows’ model averaging method to do forecast and shows that the Mallows’ criterion is an asymptotically unbiased estimator of both the in-sample mean squared error and the out-of-sample one-step-ahead mean squared forecast error. Hansen  studies least squares estimation of an autoregressive model with a root close to unity by proposing two measures to evaluate the efficiency of the estimators: the asymptotic mean squared error and forecast expected squared error. Numerical comparison of Mallows’ model averaging method with many other methods shows that Mallows’ model averaging estimator often has smaller risk. Hansen  applies the same idea for model averaging with autoregressions with a near unit root. Since Hansen  assumes that the models are nested and the weights are discrete, Wan et al.  relaxed these two assumptions to obtain other versions of model averaging by minimizing Mallows criterion. Their proofs are based on Li . Liang et al.  develop a model weighting mechanism that involves minimizing the trace of an unbiased estimator of the model average estimator’s MSE. Hansen and Racine  propose to select the weights of least squares model averaging estimator by minimizing a deleted-1 cross-validation criterion (the jackknife model averaging (JMA)). The solutions of the above methods are obtained by quadratic programming. Zhang et al.  propose a model averaging scheme for linear mixed-effects models and prove their method to be asymptotically optimal under some regularity conditions.
Various above optimal weights do not use the most common straightforward square error loss function. The intention here is to use this loss to derive optimal weights. Unlike the others methods, since the risk function of the model averaging obtained depends on model parameters, comparisons should be made along the parameter space with others post-model selection and model averaging estimators. The important question to ask is whether within this framework optimal weights are really optimal? In particular in terms of risk function, is it preferable to perform model selection or model averaging? Others existing methods (not using the risk function in the parameter space) advocate model averaging over model selection. The following Section describes conceptually model averaging and PMSEs while Section 3 describes the concept of optimal weight. Section 4 illustrates the point with a simple linear regression model with derivation of optimal weights in this case, and Section 5 provides a theoretical justification for the fact that optimality does no longer hold when paramaters are estimated. The article ends with concluding remarks.
2. Model Averaging and Post-Model Selection Estimators
Let M be a set of plausible models to estimate , the quantity of interest. Denote by the estimator of obtained from using model . Model averaging involves finding non-negative weights, that sum to one, and then estimating by
Clearly, by taking only one of the weights equal to one, and the other weights all zero, the model averaged estimator reduces to the estimator in a single model. This important sub-class of model averaging estimators is arrived at by model selection. There the weight of model is set to one if and only if the model selection method selects model , and the weight is zero otherwise.
Some classical model averaging weights involve penalized likelihood values. Let denote an information criterion of the form
with a penalty for model and the maximized likelihood value at model . Buckland et al.  define Akaike-type of weights:
In particular, if we use the Akaike information criterion (AIC, Akaike ) with , two times the number of parameters of model , (3) simplifies to
When the Bayesian information criterion (BIC, Schwarz ) is used, with and the sample size, the resulting weights are
With equal prior weights to each of the models , this may be interpreted as an approximation to the posterior probability of model given the data.
In the context of regression and classification, LeBlanc and Tibshirani  propose to use a non-penalized likelihood value, resulting in Hjort and Claeskens  use the smooth focused information criterion (FIC) and other model averaging schemes to study this type of model averaged, or compromise estimators, together with their limiting distributions and risk properties. Model averaging in semiparametric regression with AIC or BIC type weights is studied by Claeskens and Carroll . For details discussion on model averaging and its applications, see e.g. [43-46]. Zou and Yang  apply model averaging for time series while, Yuan and Yang  explain under which conditions should one apply model averaging. Shan and Yang  apply model averaging for quantile estimation while Liu and Yang apply it in longitudinal data analysis. If one is able to find closed form expressions of the model selection probabilities (Note that the expectation of a Bernoulli variable is the probability of success, say ) for each model then an obvious weighting scheme is to use an estimator of these probabilities.
There is also a special case of model averaging estimator where only zero/one weights apply, post-model selection estimator (PMSE, [15, 16]). We use a model selection criterion to decide on a selected model , and use this model to estimate the parameter of interest by , that is, the estimator of in the selected model. Using the notation introduced above, we may write PMSE as
It is important to stress that since the model selection method depends on the actual data, the selected model is random as well. This implies that even when the same set of models M and the same selection criterion are used, different samples can lead to different models () being selected. The selected model depends also on the selection procedure and the set of models M.
3. Optimal Model Averaging Estimator
A question that arises is whether one can select the weights so as to optimize the performance of this averaged estimator, in terms of some specified measure, say a loss function . Finding the optimal weights involves solving the following optimization problem over :
where the expectation is taken with respect to the true model , which may, or may not, be in the set of competing models, M. If the true model is known then, at least in theory, it is possible to find the optimal weights, if they exist. However, the optimal weights, obtained by minimizing (6), depend on the parameters of , which are unknown and thus have to be estimated. Using estimates for the may lead to weights that are no longer optimal.
Since the optimal weight (if exists) depends on the parameters, Hjort and Claeskens  suggest to minimize the estimated risk. A closed-form solution for optimal weights in general is unlikely to exist when the models are complexes, but the intention here is to investigate the lung run properties of model averaging estimator when a close-form solution exists.
4. Ilustration with Simple Linear Regression4.1. Problem Set-Up
Consider a simple linear regression model in which is a covariate and is the response variable, given by
where the , known (for simplicity).
The OLS estimators are given by
For simplicity of computations, suppose , without loss of generality, since linear regression model (7) can be parametrised as
where and .
Thus, under , and Cov, also and are normally distributed, therefore and are independant.
Let be a future value of the covariate. The aim is to estimate the mean .
Consider two models
; =Var. Let , then and , where (standardized intercept). Let , then , , where (standardized slope).4.2. Post-model Selection Estimators
Consider a selection criterion of the form (2) where , , , . is chosen if .
PMSE estimator can be written as
where and are, respectively, indicator functions under and with .
It follows that
The expression (11) is equivalent to
We used simulated data to investigate the properties of different PMSEs, namely samples of size , with , , , , and . The reported results are not sensitive to the choice of these selected values and data; in particular, increasing the sample size, has minimal impact on the results. All expectations here were taken with respect to the full model, . As the risk functions are symmetric around zero we only display the graphs for . All computations are performed with the software R .
For some classical selection procedures, values of are given by
with as the non central Fisher distribution, namely .4.3. Model Averaging
For any weighting scheme and for model and model , , the model averaging estimator is
Using formulae of Akaike weights and likelihood weights given in (4), (3) and (5), we have and . Akaike weights are then given by
and BIC weights by
More generally here, weights using penalized information criterion of the from (2) where are given by
Thus for non-penalized information criterion, likelihood weights are given by
Consider the loss here to be the square error, therefore the measure for the optimal weight is the mean square error.
Proposition 1. Under Equations (7) and (15) the optimal weighting scheme is and
where is constant i.e does not depend on .
, therefore is a minimum.
Corollary 1. The model averaging estimator based on estimates of the optimal weights is
Proof. From Proposition 1, depends on the unknown parameter , need to be estimated. Thus . Replacing the weights in Equation (15) yields the result.
Figure 1 shows for each PMSE, its risk and the risks of various weighting scheme. It shows that none of the weighting schemes (including the optimal weights) is better than any PMSE over the whole range of . Optimal weight scheme is even worse for larger .
5. Why Optimal Is Not Optimal?
Consider the regression model
where is the value of a p-dimensional design variable at the ith observation, is the response, is the true regression function and the random errors are assumed to be independent and normally distributed with mean 0 and variance .
For simplicity, let assume that the model is parametric and can be written in vector form as
where for each model , the family is a linear family of regression functions with of finite dimension .
Assumption 1 (Yang , pp.941). There exist two models and such that:
(a) is a linear subspace of ;
(b) there exists a function orthogonal to at the design points, with bounded between two positive constants, at least for large enough ;
(c) there exists a function such that is not in any family , for , that does not contain .
Consider a model averaging method . Let be the resulting data-dependent weight for model satisfying and . The regression estimator is thus . Let .
Definition 1. A model averaging method is said to be consistent in weighting if, when the true model , we have that as .
Theorem 1 (Theorem 2 of Yang , pp.943). Under Assumption 1, if any model averaging method is consistent in weighting, then we must have
Theorem 1 clearly explains why within this framework, none of the weight can be expected to dominate all the others in terms of risk function.
6. Concluding Remarks
The aim of this paper has been to show that tough many optimal model averaging schemes have evolved recently, they may fail to exist under square error loss when different estimators are compared in the parameter space using the risk function. We show this by deriving the optimal weighting scheme and demonstrated that these weights are no longer optimal when the parameters are estimated. In particular within this framework model averaging is not preferable to model selection. The example used is very simple but is enough to illustrate the problem.
I wish to thank Professor Walter Zucchini for valuable comments, which led to an improvement of this paper.
|||Linhart, H. and Zucchini, W. Model selection, John Wiley and Sons, New York, 1986.|
|||McQuarrie, A. D. R. and Tsai, C. L. Regression and time series model selection, World Scientific, Singapore, 1998.|
|||Burnham, P. K. and Anderson, D. R. Model selection and multimodel inference, a practical information-theoretic approach, Second Edition, Springer-Verlag, New York, 2002.|
|||Claeskens, G. and Hjort, N. L. Model selection and model averaging, Cambridge University Press, Cambridge, 2008.|
|||Bancroft, T. A. On bias in estimation due to the use of preliminary tests of significance, Annals of Mathematical Statistics 15 1944, 190-204.|
|||Breiman, L. The little bootstrap and other methods for dimensionality selection in regression: X-Fixed predictor error, Journal of the American Statistical Association 87 1992, 738-754.|
|||Hjorth, J. Computer intensive statistical methods:Validation, model selection, and bootstrap, Chapman and Hall, London, 1994.|
|||Chatfied, C. Model Uncertainty, data mining and statistical inference (with discussion), Journal of the Royal Statistical Society, series B 158 1995, 419-466.|
|||Draper, D. Assessment and propagation of model uncertainty (with discussion), Journal of the Royal Statistical Society, series B 57 1995 45-97.|
|||Buckland, S. T., Burnham, K. P. and Augustin, N. H. Model selection: An integral part of inference, Biometrics 53 1997, 603–618.|
|||Zucchini, W. An introduction to model selection, Journal of Mathematical Psychology 44 2000, 41–61.|
|||Candolo, C., Davison, A. C. and Demétrio, C. G. B. A note on model uncertainty in linear regression, The Statistician 158 2003, 165-177.|
|||Hjort, N. L. and Claeskens, G. Frequentist model average estimators, Journal of the American Statistical Association 98 2003, 879-899.|
|||Efron, B. The estimation of prediction error: covariance penalties and cross-validation, Journal of the American Statistical Association 99 2004, 619-642.|
|||Leeb, H. and Pötscher, B. M. Model selection and inference: Fact and fiction, Econometric Theory 21 2005, 21-59.|
|||Berk, R., Brown, L. D., Buja, A., Zhang, K. and Zhao, L. Valid post-selection inference, Submitted to Annals of Statistics, 2012.|
|||Longford, N. T. Editorial: Model selection and efficiency-is ’which model ...?’ the right question?, Journal of Royal Statistical Society-A 168, Part 3 2005, 469-472.|
|||Schomaker, M., Wan, A. T. K. and Heumann, C. Frequentist model averaging with missing observations, Computational Statistics and Data Analysis 54 (12) 2010, 3336-3347.|
|||Zucchini, W., Claeskens, G. and Nguefack-Tsague, G. Model selection, In International Encyclopedia of Statistical Sciences, Editor: M. Lovric, Springer. Part 13, 830-833, 2011.|
|||Liu, W. and Yang, Y. Parametric or nonparametric? a parametricness index for model selection, Annals of Statistics 39 (4) 2011, 2074-2102.|
|||Nguefack-Tsague, G. and Zucchini, W. Post-model selection inference and model averaging, Pakistan Journal of Statistics and Operation Research 7(2-Sp) 2011, 347-361.|
|||Nguefack-Tsague, G. An alternative derivation of some commons distributions functions: A post-model selection approach, International Journal of Applied Mathematics and Statistics 42(12) 2013, 138-147.|
|||Nguefack-Tsague, G. On bootstrap and post-model selection inference, International Journal of Mathematics and Computation 21(4) 2013, 51-64.|
|||Nguefack-Tsague, G. Bayesian estimation of a multivariate mean under model uncertainty, International Journal of Mathematics and Statistics 13(1) 2013, 83-92.|
|||Nguefack-Tsague, G. Estimation of a multivariate mean under model selection uncertainty, Pakistan Journal of Statistics and Operation Research, 2014, forthcoming.|
|||Nguefack-Tsague, G., Zucchini, W. and Fotso, S. On correcting the effects of model selection on inference in linear regression, Syllabus Review 2(3) 2011, 122-140.|
|||Hoeting J., Madigan D., Raftery A. and Volinsky C. Bayesian model averaging: A tutorial, Statistical Science 4 1999, 382-417.|
|||Wasserman, L. Bayesian model selection and model averaging, Journal of Mathematical Psychology 44 2000, 92-107.|
|||Wan, A. T. K., Zhang, X. and Zou, G. Frequentist model averaging estimation: a review, Journal of Systems Science and Complexity 22 (4) 2009, 732-748.|
|||Hansen, B. E. Least squares model averaging, Econometrica 75 2007, 1175-1189.|
|||Hansen, B. E. Least squares forecast averaging, Journal of Econometrics 146 2008, 342-350.|
|||Hansen, B. E. Averaging estimators for regressions with a possible structural break, Econometric Theory 35 2009, 1498-1514.|
|||Hansen, B. E. Averaging estimators for autoregressions with a near unit root, Journal of Econometrics 158 2010, 142-155.|
|||Wan, A. T. K., Zhang, X. and Zou, G. Least squares model averaging by Mallows criterion, Journal of Econometrics 156(2) 2010, 277-283.|
|||Li, K. C. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set, Annals of Statistics 15 1987, 958-975.|
|||Liang, H., Zou, G., Wan, A. T. K. and Zhang, X. Optimal weight choice for frequentist model average estimators, Journal of the American Statistical Association 106 2011, 1053-1066.|
|||Hansen, B. E. and Racine, J. S. Jackknife model averaging, Journal of Econometrics 167 (1) 2012, 38-46.|
|||Zhang X., Zou G. and Liang H. Model averaging and weight choice in linear mixed-effects models, Biometrika 101 (1) 2014, 205-218.|
|||Akaike, H. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, eds. B. Petrov and F. Csáki, Budapest: Akadémiai Kiadó 1973, 267-281.|
|||Schwarz, G. Estimating the dimension of a model, The Annals of Statistics 6 1978, 461-464.|
|||Leblanc, M. and Tibshirani, R. Combining estimates in regression and classification. Journal of the American Statistical Association 91 1996, 1641-1650.|
|||Claeskens, G. and Carroll, R. J. An asymptotic theory for model selection inference in general semiparametric problems, Biometrika 94 2007, 249-265.|
|||Yan, Y. Combining different procedures for adaptive regression, Journal of Multivariate Analysis 74 2000, 135-161.|
|||Yan, Y. Regression with multiple candidate models: selecting or mixing?, Statistica Sinica 13 2003, 783-809.|
|||Yan, Y. Combining forecasting procedures: some theoretical results, Econometric Theory 20 2004, 176-222.|
|||Yan, Y. Aggregating regression procedures to improve performance, Bernoulli 10 2004, 25-47.|
|||Zou, H. and Yang, Y. Combining time series models for forecasting, International Journal of Forecasting 20 2004, 69-84.|
|||Yuan, Z. and Yang, Y. Combining linear regression models: when and how?, Journal of the American Statistical Association 100 2005, 1202-1204.|
|||Shan, K. and Yang, Y. Combining regression quantile estimators, Statistica Sinica 19 2009, 1171-1191.|
|||R Development Core Team. R: A language and environment for statistical computing, R Foundation for Statistical Computing, Austria, 2011.|
|||Mallows, C. L. Some comments on Cp, Technometrics 15 1973, 661-675.|