The purpose of this study was to evaluate the methods used to assess “optimism” in regression models. Particularly, focus was on the use of pseudo R2 values of cox &snail and the Nagelkerke to identify the best statistic for measuring “optimism” in regression models, measure model performance and determine the relationship between “optimism” and over fitting. Different underlying data sets assume different models that fit their data accurately. However, the fitted regression models usually fit the data they are based on better than new data. This is what we call ‘optimism’. Specific focus will be on determining the best statistic for measuring optimism in regression models, assess model performance using ‘optimism’ through cross-validation and also determining the relationship between optimism and over fitting of regression models. The study focus on three models (Cox-regression, Logistic regression and Linear Regression) and bootstrap procedure was used.
Regression models are powerful tools that are frequently used variously by both researchers and scholars in studies 1 in his work on regression model and forecasting he found out that regression models provided the analysis and estimation of parameters for forecasts. Regression models can also handle partially observed (censored) responses 2 in his study on survival analysis censoring marked the difference between other statistical analysis and survival analysis. A fitted regression model will fit the data it was based on better than any other new data 3 in studying the procedure when adjusting for optimism and over fitting in measures of predictive ability using bootstrapping prognostic models performed differently with test data from the training data. It is a requirement for analysts to create prognostic models that have the ability to reflect accurately the patterns that exist in different underlying data sets.
1.1. Measuring Prediction ErrorUsually it is paramount for a researcher to assess the quality of every model before using it in any data set. By virtue of natural grounds, common practice, most models are highly optimized for the data in which they were trained 4 when investigating the role of noise variables in model building the characteristics of the training data plays a major role in the complexity and overall performance of any model. Expected errors exhibited on new data will always be higher than the expected errors on training data 5 used model validation in studying optimism for training error where it was discovered that test data had higher values of optimism whenever used to test model performance.
When the modeler is more optimistic, then the training error will be better compared to the value of the true error.
From this perspective, a model that minimizes training error will automatically reduce the predication error for a new data set. It is therefore recommendable to ignore the distinction between training error and the prediction error to allow for model selection 6 when lying down the criteria for model selection the role of training and prediction errors were observed in the overall method assumed by the modeler and hence their assumption. The reason being that optimism is a function of model complexity, as complexity increases so does optimism. The relationship for the true prediction error takes the following form;
Model complexity increases when the number of parameters is increased and this will ensure the model does a better job for training data, which is a fundamental property of statistical models 7 when studying mathematical and computer modeling it was seen that lack of methodology in choosing the best model fit stems from a poor understanding on the ways in which the study rely on the particular model.
Excellent tools of prediction have always been models well fit 8 when studying level of crime in the city of Salinas, the absence of statistical tools in predicting was a major blow until when regression models were applied. The understanding is vested on the total predictor degrees of freedom (d.f) as the number of parameters examined during analysis.
Use of informal analytical methods like graphical work makes one unable to determine the value of Instead it is paramount to estimate the effective number of parameters considered according to the flexibility of fits considered at the initial stages of analysis. The predictor degrees of freedom is the number of parameters allowed for consideration, in other words, it is the number of regression coefficients estimated without algebraic restrictions. It is suggested that as a rough rule of thumb, in order for one to validate a new sample using predictive discrimination, the predictor degree of freedom should not be more than /10, where is the number of uncensored event times in the training sample. When we consider binary outcome, is the number of outcome in the less frequent event. If the quantity then the analyst has to choose a data reduction technique that takes care of this and shrinkage is the best method.
1.2. Problem StatementAssessing “Optimism” in regression models has been approached differently using different methods. There is need to evaluate some of these methods so that a better one is identified. However, these fitted regression models usually fits the data they are based on better than any new data. This is what we call ‘optimism’ of the model.
1.3. Justification of the StudyThe main goal of this study is to assess the methods for evaluating “optimism” in regression models and the first step is to identify a statistic for measuring “optimism” focusing on pseudo values of Cox&Snell and the Nagelkerke. Using these statistics, the model performance would also be evaluated and again used to determine the relationship between “optimism” and over fitting. These methods have varied degree of measure and assessment. The tragedy is identifying the best out of these. The most important of all these techniques is the ability of a technique to ascertain the estimated model performance and the model’s variance and stability.
1.4. Objectives of the StudyThe main objective of this study is to evaluate the methods of assessing ‘optimism’ in regression models.
i. To determine the best statistic for assessing “optimism” in regression models
ii. To assess model performance using ‘optimism’ through cross-validation.
iii. To determine the relationship between “optimism” and over fitting of regression models.
1.5. Hypotheses of the StudyThis study seeks to evaluate the methods of assessing “optimism” in regression models.
The study seeks to test the following hypotheses;
i. Null hypothesis: there exists no statistic for assessing “optimism” in regression models
ii. Null hypothesis: there is no significant difference in performance among the three models (cox, logistic and linear regression models)
iii. Null hypothesis: there is no relationship between optimism and over fitting of regression models.
1.6. Significance of the StudyThe common goal for every model builder, researchers, scholars and academicians is the zeal to come up with prognostic models reliable and accurate for training and unforeseen data 9 in studying prognostic models in chronic liver diseases, the prognostic structure in data can be studied in many different ways however the most recent and accurate method was the use of regression models, the cox regression models.
1.7. Scope of the StudyThe study purports to rivet on assessment of optimism exhibited by cox regression, logistic and linear regression models. Bootstrapping resampling technique was used in evaluating the two pseudo-R-square measures of Cox&Snell and Nagelkerke for assessing optimism with regard to these three categories of regression models. This is due to the nature and wide use of these models by both scholars and researchers in different fields and applications.
In this chapter the study gave a detailed review of past studies on regression models, different types of regression models, more specifically cox regression models logistic and linear regression models.
2.2. Types of Regression ModelsIn predictive modeling, most people have considered linear and logistic regression models as the first algorithms 10 listed logistic and linear models as the most used models in data modeling and research. However it is paramount to understand the existence of several types and forms of regression models that have use and applications in different types of research data 7 tried to use linear regression to model binary data but he resorted to logistic regression when he could not infer.
Among the modeling techniques, linear regression occupies the first position. The metric R-Square can easily be used for model performance evaluation. The variance of coefficient estimates can increase due to the effect of multicollinearity, this makes the estimates very sensitive to minor changes in the model, prediction errors pronounced optimism 11 in his study on multiple regression, he found out that multicollinearity occurs when independent variables in a regression model are correlated. For the simple linear regression model we assume a model of the form;
Where and are two unknown constants that represent the intercept and the slope. They are also known as the coefficients or parameters and are the error term. If we consider some estimates and of the model coefficients parameters, we predict the future using;
Where indicate a prediction for on the basis of
Multiple linear regressions
The expression for the multiple linear regressions assumes the following form;
give the average effect on of unit increase of the independent variables holding all other factors constant.
Interpretation of the regression coefficients
Each of the coefficients can be estimated and tested separately. Correlations amongst predictors cause problems. When the predictors, change, then interpretation become hazardous since everything else changes. For observational data, claims of casualty should be avoided.
Obtaining the likelihood function for linear regression model
The simple linear regression model states that the errors are independent and normally distributed with mean and variance .
The linearity condition;
Therefore implies
Therefore the likelihood function is;
This can be written as;
Taking the log of both sides; we obtain;
This is the log-likelihood function of the linear regression model.
The application of logistic regression is strictly on binary data (0/1, True/False, Yes/No). It is of great importance to note that the values of the response variable range from 0 to 1. The ideal equation for logistic regression is as shown below;
Because logistic regression predicts probabilities rather than classes, we can fit it using the likelihood function. For each training data points, we have a vector of features and the observed class,. The probability of that class is either if, or if .
The likelihood is then;
Taking the log of both sides, the resulting equation becomes;
This is the log likelihood function for a logistic regression model.
Cox regression model provides an estimate of the treatment effects on survival after adjustment of the explanatory variables. The regression employed by cox is proportional hazards regression analysis. The cox PH model takes the following form;
where are the predictor/explanatory variables. Explanation of the formula; product of two quantities;
Is called the baseline hazard, exponential sum of and
Zero reduces to baseline hazard. The baseline hazard is an unspecified function.
Important properties of the cox PH formula;
The baseline hazard does not depend on but on
The exponential involves the but not , the are time dependent. The proportional hazard function follows. There exist a number of reasons that make the cox PH model popular;
1. Robustness; the cox model is a “safe” choice of model in most modeling situations that researchers can go for.
2. The model form; makes the estimated hazards to be always non-negative.
3. and can be estimated for a cox model using a minimum of assumptions.
4. The cox model is preferred to logistic model in survival data modeling because logistic model ignores survival times and censoring information.
Computing the hazard ratio;
The hazard ratio is defined as;
where and
Obtaining the likelihood function under censored data
Assuming we have units whose lifetimes are governed by a survivor function with associated density and hazard . Suppose unit is observed for a time, if the unit died at its contribution to the likelihood function is the density at that duration which is the product of the survivor and the hazard function;
If the event is still alive at time then under non-informative censoring, the life time will exceed with probability given as; which becomes the contribution of the censored observation to the likelihood. Let be a death indicator taking the value if one unit died and zero otherwise, then the likelihood will be written as;
Taking the log and recalling the initial expression that links the survival function and the cumulative hazard function we obtain the log likelihood function for the survival model given as;
Interpretation of the Hazard ratio
The hazard for one individual is divided by the hazard for a different individual. During interpretation, one usually wants . That is; . This therefore means
: the group with larger hazard and the group with the smaller hazard.
2.3. Model Validation and AssessmentScrutiny of the manifest accuracy of a multivariable model is not useful when using training dataset 12 when applying parametric spectral analysis to multichannel event related potentials during cognitive experiments found out that model assessment was crucial for proper data processing and prediction.
This chapter gives stringent interrogatory features that were palpable in the study.
3.2. Design of the StudyThe study design was simulation. The simulated data formed our population and original data from where the bootstrap samples were obtained. The discrimination statistics which are the two pseudo R-square values, the cox&snail and the Nagelkerke (Cragg & Uhler’s) r-squared were be compared.
The cox&snell;
value together with the Nagelkerke:
were obtained from the simulated data
Where;
is the likelihood of the intercept model (model without predictors)
is the likelihood of the model with parameters
is the number of observed data sets
1. The first step is to develop the model using all the n subjects and carry out any testing that may be necessary. Let denote the apparent from the model formed. This is the scaled chi-square computed on the same sample from which the fit has been derived from.
2. It follows that we generate a sample of size n with replacement from the original sample considering both predictors and response.
3. From the bootstrap sample compute the apparent from this model and denote it as
4. Let denote the apparent from the original dataset. Then ‘freeze’ the developed model and evaluate its performance on the original dataset.
5. Compute the optimism by -
6. Steps 2 to 6 are then repeated 100-B times
The corrected bootstrap performance of the original stepwise model is -0 this difference is closer to the unbiased estimate of the expected value of the external predictive discrimination that generated which is an estimate of internal validity penalizing over fitting.
Using data from a hypothetical population, simulations were conducted at individual setting of variables. Averages of performance measure were taken over B repetitions for a chosen m number events on the predictor variables.
From the model for obtaining optimism;
we obtain optimism for the two statistics for the three models as follows;
For the cox-regression model
- For the logistic regression model
For the linear regression model
Optimism 2
-
- For the cox-regression model
- For the logistic regression model
- For the linear regression model
It is from these measures that the least optimism was be obtained and hence the one that was closer to zero was the best statistic.
Using the optimal statistic lets denote this as obtained in procedure above between and the values were compared across the three models. Build the original model from the training data and obtain the values of Bootstrap samples serve as the testing sets for -cox, logistic and linear models. For each of the model obtain the difference for the values of average them to get the value of “optimism”. The model with its value of closer to zero will be regarded as the best performing under optimism.
Build models with at least three parameters and obtain the value of for each of the three categories of interest (cox, logistic and linear) regression models. From the models initially built, increase the number of parameters from three to four, five, six, seven and if possible eight parameters. Obtain the values of . Obtain the values of “optimism” and compare across the models with reference to the number of parameters modeled.
The data for this study was obtained through simulations of hypothetical populations. For the cox regression and logistic models, at least two categorical variables formed part of the predictor variables. The package “simstudy” alongside “survival” and “BaylorEdPsych” in R were used.
The study was mainly based on simulated data and therefore to ensure it was fit for use in the achieving the objectives of the study, it was exposed to a number of checks.
To perform a trend test, the cox model was fit with a factor predictor variable scored as 1, 2, 3… and later a post hoc trend test was conducted.
It can be noted that for the status variable status=0 which is implicitly part of the contrast. The full coefficient vector is (0, 0.369, 0.916, and 2.208) and the linear contrast is 0,1,2,3. Thus the data is fit for continued analysis.
These give the average probability of an event surviving within a given period of time as shown in the graph below;
The qqplots below provide a visual diagnistics for binary data and the simulated data for linear modelling.
4.2. Identifying the Best Statistic for Assessing “optimism” in Regression ModelsThe two inferential Pseudo r-square discrimination statistics were obtained forming our apparent values of interest. The values from the three models were as shown below;
From Table 2, the original values were 0.07347012 and 0.078304671 for the cox regression model for the cox&snell and Nagelkerke pseudo R-square values respectively. The average pseudo r-square values after running 800 bootstrap samples were 0.07998 and 0.85951625.This gave an optimism value of 0.00651 and 0.00765 for the cox&snell and Nagelkerke pseudo R-square values respectively. For the linear regression model the original values were 0.516475 and 0.711327 again for the cox&snell and Nagelkerke pseudo R-square values in that order while the average Pseudo R-square values after 800 bootstrap samples were 0.519114 and 0.672839 giving an optimism value of 0.00264 and 0.03849. Thirdly for the logistic regression model the original values were 0.018877 and 0.012092 while the average Pseudo R-square values were 0.121288 and 0.148447 yielding an “optimism” value of 0.10241 and 0.13636. The different values of “optimism” exhibited by the two statistics for the three models nicely confirms the fact that “optimism” in measure of predictive ability of a model is a function of the size of data set holding other things constant. In reference to 3 this does not rule out the fact that “optimism” is also a function of the complexity of the fitted model.
The Cox & Snell pseudo R-square is given as;
While the Nagelkerke pseudo R-square takes the form;
Where is the likelihood function for the intercept model i.e. the model with only the intercept variable. While is the likelihood function of the full model.
The ratio of the likelihoods shows the improvement of the full model over the intercept model (the smaller the ratio the better the improvement). If there are observations in the data set, then is the product of such probabilities. Thus obtaining the root of the product provides an estimate of the likelihood of each value.
It is clear from the results above that indeed Cox &Snell pseudo r-square statistic has a good measure of “optimism”.
4.3. Assessing Model Performance Using ‘optimism’ through Cross-validationFor the second objective on assessing model performance using “optimism” through cross validation, the focus was on the optimal statistic from the first objective and this was the Cox &Snell pseudo r-square statistic. The simulated data was partitioned into two; the training and the testing data set. Seventy five (75%) percent was used for developing (training) the model while twenty five percent (25%) was used to test and validate the model. The optimism values for the three models were obtained as shown in the table below; linear regression model had the lowest “optimism” value of 0.04438 followed by cox-regression model with an “optimism” value of 0.06473 and coming third was logistic regression model with an “optimism” value of 0.15682.
This means therefore linear regression models perform better in prediction compared to cox and logistic regression models. According to Oredein et al, (2011) model validity is the reasonableness and stability in performance on prognostic measures of interest. These results show that indeed the value of “optimism” can be used to measure model performance under cox&snell pseudo r-square statistic.
4.4. Determining the Relationship between “optimism” and Over Fitting of Regression ModelsTo achieve this objective, the study employed two strategies that influence prediction and performance of prognostic models. These were sample size and the number of predictor variables.
Boot strap samples of different sizes were drawn and the size of “optimism” using the Cox & Snell statistic determined. These were compared across the three models; cox regression model, logistic model and linear regression model. The results were as shown in the table below; small sample sizes have low “optimism” while large samples experience increasing “optimism” as can be seen for n=400, the value of optimism is 0.00805740 while for n=3000, ‘optimism’ is 0.0520398 for the cox regression model, similarly for the logistic regression models and linear models, “optimism” increases with increase in sample size. This confirms the Peduzzi and Concato (1995) 5-10 events per variable rule that indeed it results to small sample sizes leading to over fitting and optimism. The correlation between sample size and optimism for the three models is a positive one increasing with an increase in sample size. Correlation between sample size and optimism for the cox regression model is 0.3696021, for logistic regression we have 0.4388737 and 0.6382342 for the linear regression model.
This is a clear indication that there exist a positive relationship between sample size and “optimism” for prognostic models.
To achieve this study obtained the ‘optimism’ values of model fit with a minimum of four predictors. Using three predictors as the reference measure of “optimism”, the values of “optimism” for the other models were obtained are as shown in the table below; “optimism” was increasing with increase in number of predictor variables for the three models. When the predictors were four in number, the value of optimism was 0.083384358 for the cox regression model, 0.00711203 for linear model and 0.081523 for logistic model. When the predictors were increased to eight, the optimism values obtained were 0.088940566 for cox regression model, 0.03582399 for linear model and 0.0853161 for logistic model.
According to 13 over fitting results to “optimism” about a model’s performance on new data. In over fitting, a model describes the random error or the noise instead of the underlying relationship. This agrees with the theory of model complexity that an attempt to over fit a prognostic model will automatically result to the model becoming optimistic. Hence it can be drawn from the result that over fitting has a direct positive relation with optimism.
The multiple line graphs below give a pictorial presentation of the relationship between ‘optimism’ and over fitting for the three prognostic models.
It can be seen logistic model has the highest ‘optimism’ values when the predictor values are increased. Linear regression models have the least tendency of being ‘optimistic’.
The main objective of the study was to evaluate methods used to assess “optimism” in regression models; the use of inferential pseudo r-square statistics through bootstrapping is indeed very informative and reliable. The use of cox&snell pseudo r-square statistic provided a platform to measure optimism in models that cannot be determined using the ordinary r-square statistic; a special case is the logistic regression model. Note that choosing cox&snell pseudo r-square as the best statistic to measure optimism does not leave out Nagelkerke statistic since they all determine model performance which is an important element to every model builder. Under large samples, they both give similar results. Levels of Optimism have a direct influence to model performance. Optimistic models will give unreliable results since they will only predict well the data that was used to develop them.
Larger samples minimize prediction errors however, when the noise variables are modeled as opposed to the underlying variables of interest then the model fails to stand the test of good fit and prediction. When the samples are large we have a wide window of modeling noise variables as opposed to smaller samples however care should be taken when deciding the sample size to avoid under fitting, where underlying model fails to capture the trend of the data at hand. The more the predictor variables the more optimistic the model becomes rendering the model less reliable in prediction. From the results of the study it would be plausible to recommend the use of pseudo r-square statistics in determining “optimism” of regression models.
Further studies need to be conducted on assessing the discrimination ability of models using the pseudo r-square statistics, Possibility of using the pseudo statistics when inferring on model fit as opposed to the ordinary r-square since they can be computed for all prognostic models.
[1] | Curtis, K. (2012). Book Review: Spatial Regression Models Ward M.D.GleditschK.S.2008. Spatial Regression Models. Thousand Oaks, CA: Sage. ISBN 978-1-4129-5415-0. Sociological Methods & Research, 41(4), 671-674. | ||
In article | View Article | ||
[2] | Fahrmeir, L. (2013).Regression. Berlin; Springer. | ||
In article | View Article | ||
[3] | Bartlett, J. (2014). Adjusting For Optimism/Overfitting in Measures of Predictive Ability Using Bootstrapping. | ||
In article | |||
[4] | Kasza, J., & Wolfe, R. (2013). Interpretation of commonly used statistical regression models. Respirology, 19(1), 14-21. | ||
In article | View Article PubMed | ||
[5] | J. Rispoli, F., & Shah, V. (2015). Using Simulation to Test the Reliability of Regression Models. Energy and Environment Research, 5(1). | ||
In article | View Article | ||
[6] | Sugiyama, M. (2016). Model Selection for Maximum Likelihood Estimation. Introduction to Statistical Machine Learning, 147-156. | ||
In article | View Article | ||
[7] | Ziegel, E. R., & Staff, S. I. (1996). Logistic Regression Examples Using the SAS System. Technometrics, 38(1), 86. | ||
In article | View Article | ||
[8] | Shingleton, J. (2003). Crime Trend Prediction Using Re Gression Models For Salinas, California. | ||
In article | View Article | ||
[9] | Christensen, E. (1997). Prognostic models in chronic liver disease: validity, usefulness and future role. | ||
In article | View Article | ||
[10] | Smith, H. (2014). Regression Models, Types of. Wiley StatsRef: Statistics Reference Online. | ||
In article | |||
[11] | Mannan, H. R., & McNeil, J. J. (2012). Computer programs to estimate overoptimism in measures of discrimination for predicting the risk of cardiovascular diseases. Journal of Evaluation in Clinical Practice, 19(2), 358-362. | ||
In article | View Article PubMed | ||
[12] | Leon, L., & Cai, T. (2012). Model checking techniques for assessing functional form specifications in censored linear regression models. Statistica Sinica, 22(2). | ||
In article | View Article PubMed | ||
[13] | Steyerberg, E. (1999). Stepwise Selection in Small Data Sets A Simulation Study of Bias in Logistic Regression Analysis. Journal of Clinical Epidemiology, 52(10), 935-942. | ||
In article | View Article | ||
[14] | Kazak, A., & Kazak, R. (2003). Does cross validation provide additional information in the evaluation of regression models? Canadian Journal of Forest Research, 33(6), 976-987. | ||
In article | View Article | ||
Published with license by Science and Education Publishing, Copyright © 2018 Daniel Thoya, Antony Waititu, Thomas Magheto and Antony Ngunyi
This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
[1] | Curtis, K. (2012). Book Review: Spatial Regression Models Ward M.D.GleditschK.S.2008. Spatial Regression Models. Thousand Oaks, CA: Sage. ISBN 978-1-4129-5415-0. Sociological Methods & Research, 41(4), 671-674. | ||
In article | View Article | ||
[2] | Fahrmeir, L. (2013).Regression. Berlin; Springer. | ||
In article | View Article | ||
[3] | Bartlett, J. (2014). Adjusting For Optimism/Overfitting in Measures of Predictive Ability Using Bootstrapping. | ||
In article | |||
[4] | Kasza, J., & Wolfe, R. (2013). Interpretation of commonly used statistical regression models. Respirology, 19(1), 14-21. | ||
In article | View Article PubMed | ||
[5] | J. Rispoli, F., & Shah, V. (2015). Using Simulation to Test the Reliability of Regression Models. Energy and Environment Research, 5(1). | ||
In article | View Article | ||
[6] | Sugiyama, M. (2016). Model Selection for Maximum Likelihood Estimation. Introduction to Statistical Machine Learning, 147-156. | ||
In article | View Article | ||
[7] | Ziegel, E. R., & Staff, S. I. (1996). Logistic Regression Examples Using the SAS System. Technometrics, 38(1), 86. | ||
In article | View Article | ||
[8] | Shingleton, J. (2003). Crime Trend Prediction Using Re Gression Models For Salinas, California. | ||
In article | View Article | ||
[9] | Christensen, E. (1997). Prognostic models in chronic liver disease: validity, usefulness and future role. | ||
In article | View Article | ||
[10] | Smith, H. (2014). Regression Models, Types of. Wiley StatsRef: Statistics Reference Online. | ||
In article | |||
[11] | Mannan, H. R., & McNeil, J. J. (2012). Computer programs to estimate overoptimism in measures of discrimination for predicting the risk of cardiovascular diseases. Journal of Evaluation in Clinical Practice, 19(2), 358-362. | ||
In article | View Article PubMed | ||
[12] | Leon, L., & Cai, T. (2012). Model checking techniques for assessing functional form specifications in censored linear regression models. Statistica Sinica, 22(2). | ||
In article | View Article PubMed | ||
[13] | Steyerberg, E. (1999). Stepwise Selection in Small Data Sets A Simulation Study of Bias in Logistic Regression Analysis. Journal of Clinical Epidemiology, 52(10), 935-942. | ||
In article | View Article | ||
[14] | Kazak, A., & Kazak, R. (2003). Does cross validation provide additional information in the evaluation of regression models? Canadian Journal of Forest Research, 33(6), 976-987. | ||
In article | View Article | ||