Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings

Paul Rogers; Julie Stoner

doi:10.12691/ajams-3-6-5

Article Versions

Export Article

Cite this article

Normal Style
MLA Style
APA Style
Chicago Style

Research Article

Open Access Peer-reviewed

Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings

Paul Rogers, Julie Stoner

American Journal of Applied Mathematics and Statistics. 2015, 3(6), 243-251. DOI: 10.12691/ajams-3-6-5

Published online: November 23, 2015

Abstract

Regression models for correlated binary outcomes are commonly fit using a Generalized Estimating Equations (GEE) methodology. GEE uses the Liang and Zeger sandwich estimator to produce unbiased standard error estimators for regression coefficients in large sample settings even when the covariance structure is misspecified. The sandwich estimator performs optimally in balanced designs when the number of participants is large, and there are few repeated measurements. The sandwich estimator is not without drawbacks; its asymptotic properties do not hold in small sample settings. In these situations, the sandwich estimator is biased downwards, underestimating the variances. In this project, a modified form for the sandwich estimator is proposed to correct this deficiency. The performance of this new sandwich estimator is compared to the traditional Liang and Zeger estimator as well as alternative forms proposed by Morel, Pan and Mancl and DeRouen. The performance of each estimator was assessed with 95% coverage probabilities for the regression coefficient estimators using simulated data under various combinations of sample sizes and outcome prevalence values with an Independence (IND), Autoregressive (AR) and Compound Symmetry (CS) correlation structure. This research is motivated by investigations involving rare-event outcomes in aviation data.

Keywords: sandwich estimator generalized estimating equation rare event finite sample binary outcome correlated outcome

1. Introduction

Regression models with binary outcome variables are prevalent in all research disciplines. If the data are independent, then the covariance between two measured values, which is a measure of linear dependence, is zero. If the data are dependent, Generalized Estimating Equations (GEE) can be used to account for the correlation, which is a function of the covariance among repeated or clustered measurements ¹. The GEE framework contains options for the working covariance structure based upon the assumed pattern of correlation within the data. One of the strengths of using GEE is that the sandwich or robust variance estimator produces unbiased standard errors in large sample sizes for the regression coefficients even when the covariance structure is misspecified. This is a tremendous advantage, but the sandwich estimator of variance is not without drawbacks.

It is well known that the GEE methodology has issues with small sample sizes due to the asymptotic properties inherent in the covariance sandwich estimator ^{2, 3}. Fitzmaurice et al. noted that in small or finite sample sizes, Wald tests using the Liang-Zeger sandwich estimator tend to produce p-values that are too small ³. The sandwich estimator of variance is biased downward; that is, it underestimates the variability of parameter estimators in small sample sizes. Much research has been performed to improve the performance of GEE under these circumstances. This is evidenced in the works of Mancl and DeRouen as well as Pan ^{4, 5}. Rare outcomes pose a problem as well. Even with a large sample size, a rare outcome can be viewed as a small sample problem. That is, the information concerning the event of interest is, by itself, a small sample. Adding records that do not have the outcome of interest gives no additional information to the model. If an event becomes rare enough, it becomes extremely difficult to collect enough information to construct an informative regression model. The problems GEE experiences with finite sample sizes can become exacerbated when coupled with a rare outcome.

Rare events defined as binary outcomes, which have tens of thousands to hundreds of thousands of non-events (zeroes) compared to the outcome of interest (ones), can be a challenge in observational studies or clinical trials. Logistic regression methods for independent data have binary or ordinal outcomes but can produce predicted probabilities that grossly underestimate the true probability of a rare event ⁶. At present, very few methods are available for modeling and analyzing longitudinal rare event data. The methods currently available are models based upon the Poisson distribution and are appropriate when the dependent variable involves count data. In the rare event situations, with dependent data, the variance matrix for the regression coefficients of the standard logistic regression model is biased; the estimated variances are smaller than the true variances. Furthermore, Carroll and colleagues discovered that under rare event conditions the use of the sandwich estimator with the logistic regression model produced under coverage of Wald-type tests. In the case of logistic regression using the sandwich estimator, “an important part of sample size considerations is the number of events” ². In other words, decreasing sample sizes with rare outcomes can worsen the bias of the sandwich estimator.

The GEE methods are fairly robust and compensate for correlation among repeated measures or clustered data. However, in rare event and finite sample size settings, the variances and covariances generated by these models are underestimated and lead to erroneous inferences. Other investigators have proposed corrections for rare events and finite sample sizes with correlated data but there is no universally agreed upon solution for dealing with these circumstances. These solutions have resulted in alternative sandwich estimators that still have performance issues.

We propose the use of an improved sandwich estimator that has the ability to produce unbiased estimates of variances and covariances in studies of correlated data with rare event and small sample sizes. Our approach will be to adjust the sandwich estimator to compensate for underestimation in these situations. In general, this adjustment is performed by taking an alternate sandwich estimator, developed by Pan, and improving its performance in small sample size and rare event settings by adding an appropriate inflation factor, while still preserving the asymptotic nature of the sandwich estimator. The performance of this improved sandwich estimator will be evaluated with simulated and real-world datasets.

2. Generalized Estimating Equations and the Sandwich Covariance Estimator

In general, if is a response variable and is a covariate of interest for subjects, a regression model can be utilized to describe their relationship. In the case of longitudinal data,is the index for the number of observations within a given subject. The number of repeated measurements on an individual will be represented as withbeing the measurement at the interval for the subject. Marginal models are based on quasi-likelihood and are similar in form to the Generalized Linear Model (GLM) in that a link function is used to specify a mathematical relationship, involving regression coefficients, between a marginal mean response, and one or more independent variables.

Regarding the GEE methodology, if is a vector of predicted means for the individual andis the number of regression coefficients, then where will be used to represent the partial derivatives of the vector of predicted means with respect to the vector of regression coefficients . Then is an x p matrix of these partial derivatives and appears as follows:

The variance () of the dependent variable () in the quasi-likelihood method, just as it is in GLM, can be expressed as a function of the mean. Phiis a scale parameter estimated from the data and is sometimes referred to as a nuisance parameter, as it is typically not of primary interest.

If is used to indicate the x 1 vector of outcomes for individual then let be the vector of variances for these effects. is a diagonal matrix that has taken on the values of the vector . Let represent the correlation within the clustered measurements then is the working correlation matrix for these same quantities. In this study, it is assumed that there is a correlation structure common to all subjects. If is an matrix with the variances of on the diagonal, then let indicate the working covariance matrix for these same measurements; depends on the correlation structure

In the GEE method, when the dependent variable comes from the exponential family, the following are the score equations for the regression coefficients:

Liang and Zeger (1986) demonstrated that as the number of subjects or clusters (K) increased in size, that is a consistent estimator for That is, as is asymptotically multivariate Gaussian with zero mean and covariance matrix estimated as follows.

(1)

When estimates of andare inserted, is referred to as the empirical-based,or robust sandwich, variance matrix.

3. Summary of Small-Sample Covariance Estimators

The Liang-Zeger sandwich estimator is used frequently in GEE since it produces valid standard errors asymptotically, even if the covariance structure is misspecified. The degree of bias of the sandwich estimator is an asymptotic property that is reduced as the sample size, or number of independent clusters, increases.

The problems caused by rare outcomes relative to the use of GEE models were first noted by Gunsolley while exploring the performance of GEEs with binary outcomes using a compound symmetry covariance structure⁷.

3.1. Pan Estimator

Pan argued that the covariance calculated within the sandwich estimator is not an optimal estimator of because it is based on data from the subject and is neither efficient nor consistent ⁵. Pan proposed an improvement to the sandwich estimator by using a pooled, or averaged, covariance based upon all subjects. This enhancement depends on two assumptions to preserve the asymptotic nature of Pan’s estimator:

Assumption 1. The marginal variance of needs to be modeled correctly.

Assumption 2. There is a common correlation structure across all subjects.

That is, and is a correlation matrix obtained without any parametric specification .

Pan’s sandwich estimator (), in matrix notation, can then be written as:

(2)

Pan claimed this modified sandwich estimator has greater efficiency than that proposed by Liang and Zeger as the covariance is based upon all subjects. The results of his initial simulations using an exchangeable and independence covariance structure with both a binary and continuous outcome variable support this claim ⁵.

3.2. Mancl and DeRouen Estimator

Mancl and DeRouen proposed replacing the covariance of Liang and Zeger’s (1986) sandwich estimator () with one corrected for bias. That is from equation (1) becomes the bias-corrected sandwich estimator () ⁴:

(3)

where I is an identity matrix,is the “naïve” or model-based variance estimator and . Mancl and DeRouen justify this correction on the grounds that the true expected value is expressed as rather than

3.3. Morel Estimator

Morel originally explored the covariance matrix estimate in logistic regression in complex survey designs as a product of the application of a Taylor series expansion ⁸. These results were later extended to the sandwich estimator used within the GEE framework ⁹. They clearly delineated the source of the bias suffered by the sandwich estimator in small samplesizes. It was demonstrated that most software implementations of the sandwich estimator omit the term where and represent the number of units in the cluster This term is part of the Taylor series estimation of the sandwich estimator. The omission of these terms is less serious when the sample size or number of clusters is large but becomes increasingly significant as the sample size is reduced. Morel (2003) proposed re-introducing these terms to adjust for bias in the sandwich estimator. He also recommended inflating the sandwich estimator by adding a scaled version of the sandwich estimator trace to itself. This concept is unalike those previously proposed in that it applies the adjustment to the entire sandwich estimator. Whereas most adjustments took place inside the calculation of the covariance of the sandwich estimator, this one was applied outside of the summation and not to the individual residuals. Morel’s version of the sandwich estimator, adjusted with the trace,is referred to as.

(4)

where:

Simulation results supported this modified approach because Type I error rates were nominal, even in small sample sizes, unlike the unmodified GEE and model-based covariance estimators.

A variant of Morel’s original estimator was included in this comparative study for evaluation purposes. It is identical to the estimator described in equation (4) but was inflated with the determinant rather than the trace. Morel had originally suggested evaluating the performance of the sandwich estimator inflated with the determinant but had never done so. This paper will be the first to evaluate the performance of this variant of the sandwich estimator.

Incorporating the changes proposed by Pan (2001) and Morel (2003), with some additional modifications, a new hybrid sandwich estimator will be constructed. Building on a fusion of these concepts, we believe a modified GEE estimator can be constructed that delivers accurate probabilities, nominal Type I error rates, and confidence intervals with proper coverage. The performance of this hybrid sandwich estimator is compared to the estimators of Liang and Zeger (1986), Mancl and DeRouen (2001), Morel (2003), and Pan (2001). A summary of the sandwich estimators is given in Table 1.

Table 1. Summary of sandwich estimators
Download as
PowerPoint Slide
Tables index
View option
Full Size Next Table

4. A New Hybrid Sandwich Estimator

A new hybrid sandwich estimator was created by inflating Pan’s estimator with a scaled version of the determinant. The determinant is a physical representation of the area or volume of the variances and covariances of the sandwich estimator ¹⁰. In terms of the volume, the determinant of the sandwich estimator can be expressed as:

Our recommended solution will use an averaged or pooled covariance, just as Pan (2001) did, and scale these values using the corrections originally proposed by Morel (2003).

An advantage of this strategy is that as long as the model-based estimate of variance is a positive definite matrix, the hybrid sandwich estimator will also be positive definite. Referencing Pan’s version of the sandwich estimator from equation (2) the improvements will change the final version of the Rogers hybrid sandwich estimator to generally appear as:

(5)

5. Asymptotic Properties

The asymptotic properties of the hybrid estimator follow directly from the properties of Pan’s estimator. As the number of clusters increases, the Pan and Rogers estimators become more similar. To assure the asymptotic validity of his estimator, Pan needed the two assumptions, listed in section 3.1,to hold true ⁵.

Our estimator is similar to Pan’s but with an additional inflation factor; as the number of subjects increases they converge to the same values. This can be demonstrated with the following limit;

Limit 1. As the number of subjects goes to infinity the following will hold true:

If assumption 2 does not hold, as Pan recommended, subjects can be classified into groups such that the have the same correlation matrix. Therefore, as the sample size increases and the marginal variance of is modeled correctly we expect the values of the Pan and Rogers sandwich estimators to be more similar. If assumptions 1 and 2 hold then with a large enough sample size we expect the differences in to be asymptotically multivariate Gaussian with zero mean and covariance matrix under the Pan and Rogers methodologies as well. In addition to these similarities, if the sample size and prevalence are both increased, we expect to see a convergence of similar values and performance in coverage probabilities from all of the sandwich estimators.

6. Simulation Studies

Due to the asymptotic nature of the sandwich estimators, simulations were conducted to assess their performance under varying small sample and rare event conditions. The sandwich estimators compared included the traditional Liang-Zeger , Mancl and DeRouen , Pan, Morel, a version of Morel inflated by the determinant rather than the trace, and the Rogers sandwich estimators. A model with one continuous covariate was used for simulation study. The number of clusters, prevalence, and correlation of the outcome variable were varied.

The single covariate model was fit on a series of simulated datasets with outcomes of differing prevalence values (0.01, 0.05, 0.10, 0.30, 0.50). The simulations were run with data sizesof500, 100, 50, 30,and 20 subjects. The various estimators’ performance was also compared when the simulated within-cluster correlation structure was either exchangeable or autoregressive, with correlation set at 0.005 or 0.05, and when observations within clusters were simulated to be independent. All simulations involve balanced designs with four observations per subject. These correlation values were selected based on the relationship between the prevalence of the outcome and the correlation among longitudinal measures. That is, the probability of the outcome restricts the range of possible correlation values³. Due to this relationship between the prevalence and correlation, it was not practical to simulate all combinations of prevalence values and correlations.

Simulated correlated binary data were generated with the binary SimCLF R-code library, which is based on the work by Qaqish ¹¹. The correlations were kept low due to the simulation difficulties in generating large numbers of valid outcome vectors with the binary SimCLF library in small sample size and low outcome prevalence conditions. That is to say, as the sample size and outcome prevalence decreased, the binarySimCLF produced a large number of vectors which failed its own validity check.In all simulations, the covariance structure was correctly specified. The total number of configurations, as well as the number of simulations is summarized in Table 2. Eachsimulation configuration was run 1,000 times, reporting the average of the sandwich estimator undergoing testing.

Table 2. Simulation design settings for each of the six estimators
Download as
PowerPoint Slide
Tables index
View option
Full Size Previous Table Next Table

The true values for the intercept and regression coefficient were both set to one for all tests. This model consisted of a single, normally distributed covariate with a variance of one centered at a mean appropriate to the simulated prevalence of the outcome. The relationship between the outcome prevalence and continuous covariate is given by:

The standard deviation and average estimated standard error of the estimated regression coefficients of the betas were calculated and recorded. The performance of each sandwich estimator was assessed primarily by the 95% coverage probabilities for the regression coefficients.

6.1. Demonstration of Bias as a Poor Performance Measure

A measure of performance usually used in evaluating a new statistic is the bias; the difference between the estimator’s true variance and its mean estimated variance. Estimators with a positive bias have underestimated the true variance, while those with a negative bias have overestimated the true variance.

Coverage probabilities, which are related to confidence intervals, are an alternative way of assessing performance. These confidence intervals are centered around the estimated regression coefficients, which are the same for each covariance estimation method in this simulation study. Therefore, the coverage probability in this study is only a function of the estimated variance.

The simulation environment was designed to reproduce coverage probabilities analogous to a 95% confidence interval. After completion of the simulations, the distributions of the variance estimates created by each sandwich estimator in small samples were skewed. For all covariance structures, as the simulated prevalence and sample size diminish, the distribution of the variances for each sandwich estimator becomes steeper on the lower end and right-skewed, both to a different degree. The implication is that the distribution of the variances is no longer symmetric, and the mean is no longer in the center of the distribution under these extreme conditions. These differences are so great that measures of bias are not adequate performance indicators and therefore, coverage probabilities will be reported as the performance measure.

6.2. Coverage Probabilities

The coverage probability results are only shown for the autoregressive covariance structures as the results were similar among the three types of correlation. Coverage probabilities were similar between and the intercept therefore, figures are only shown for theregression coefficient. A composite figure is used for outcome prevalence values of 5% through 50% under a .005 correlation. A single graph is dedicated to the 1% prevalence level to highlight the differences that occur under these extreme conditions. When the correlation increases to .05,a composite figure is again used to display prevalence values of 10%, 30%, and 50%. Coverage probabilities are displayed in Figure 1 through 3 for the autoregressive covariance structure.

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
NEXT
View next figure
Figure 1. Coverage probabilities when estimating the regression coefficient under a simulated autoregressive covariance structure for 0.05 through 0.5 prevalence values with 0.005 correlation

The coverage probability of our estimator for the regression coefficient (Figure 1) is very competitive with that of Pan’s. These two estimators outperform the remaining estimators at 20 and 30 subjects under a 5% prevalence. At 10%, 30%, and 50% prevalence values, the performance of the estimators begin to cluster and converge with increasing sample size, while the Liang-Zeger estimator lags behind the rest at fewer than 50 subjects.

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
PREV
View previous figure
NEXT
View next figure
Figure 2. Coverage probabilities when estimating the regression coefficientunder a simulated autoregressive covariance structure for 0.01 prevalence with 0.005 correlation

The Rogers and Pan estimators are very close in their performance in terms of coverage probabilities under a 1% outcome prevalence, with the Rogers estimator slightly edging out Pan on the smaller sample sizes (Figure 2). The remaining estimators performance is poor at 20 and 30 subjects but steadily improves as the sample size increases. At simulated sample sizes of 500 subjects, the estimators have converged to roughly the same coverage probabilities.

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
PREV
View previous figure
Figure 3. Coverage probabilities when estimating the regression coefficientunder a simulated autoregressive covariance structure for 0.1 through 0.5 prevalence values with 0.05 correlation

When the simulated correlation is .05, a similar trend can be seen for the regression coefficient (Figure 3). As the sample size and prevalence decrease, the estimators begin to diverge from one another in their performance. Typically, the Liang-Zeger estimator lags behind the others as it underestimates the variance, which decreases its coverage probability. At an outcome prevalence of 10%, our proposed estimator performs better than the other estimators. It is interesting to note that when a simulated outcome prevalence is as low as 10% is coupled with a sample size of 100, all the sandwich estimators underestimate the true variance. As the outcome prevalence increases to 50%, our estimator slightly overestimates the variance at sample sizes of 50 subjects or fewer.

7. Practical Application

In this section, we demonstrate the application of our sandwich estimator in a practical setting. The two datasets used are random samples of size 500 and 30 airmen, sampled independently, from the Federal Aviation Administration’s Decision Support Systems (DSS) and constructed as a longitudinal dataset, as described by Peterman ¹². Airmen undergo a flight physical from an Aviation Medical Examiner (AME) and must meet certain physical requirements to hold a Class I, II, or III medical certificate. The random samples taken from the DSS were restricted to airmen who took a Class I, II, or III flight physical in each of the four years over 2002-2005.

The outcome of interest, expressed as a binary variable, concerns the occurrence of an Accident and Incident Data System (AIDS) event, which can include anything from a major aircraft accident to a minor incident with only slight damage. The covariate of interest, a continuous variable, indicates the number of flight hours over the last six months self-reported by the airmen at the time of their last medical exam. The results should give us insight as to whether the number of accumulated flight hours over the last six months is associated with the occurrence of an AIDS event. The question of interest is represented in equation (6), where represents a binary outcome, with a one and zero indicating the occurrence or lack of an AIDS event, respectively.

(6)

The outcome’s prevalence, slightly under 0.5%, is lower than those investigated in our simulation study which was 1% or higher. Among all years for the sample of 500 pilots, the median flight time was 26.62 hours (inter quartile range: 24.43-29.16 hours). In the sample size of 500 subjects, one subject reporteda flight time of 20,750 hours. This outlier was omitted and imputed as the average of the three previously reported flight hours for the previous six months (300 hours).

An autoregressive correlation structure reflects correlation decay with increasing intervals of time between measurements. Use of an autoregressive structure was reasonable, given the design of the study. The analytical results for the 500 and 30 subjects are displayed in Table 3 and Table 4, respectively.

Table 3. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 500 airmen
Download as
PowerPoint Slide
Tables index
View option
Full Size Previous Table Next Table

Table 4. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 300 airmen
Download as
PowerPoint Slide
Tables index
View option
Full Size Previous Table

In our analysis of 500 subjects, the differences among the variances of the sandwich estimators for the covariate of interest from equation (6) are very small. This is not surprising as the simulation results reported that the sandwich estimator’s coverage probabilities converge to the same values in large sample sizes, even with outcome prevalence values as low as 1%. When we analyze the sample of 30 subjects, the variability of the sandwich estimators’ variance is even larger, as reflected in their values differing from one another by several magnitudes of 10. When performing statistical hypothesis testing in a situation where the outcome is of low prevalence and the sample size is small, the choice of sandwich estimator affects the outcome of hypothesis testing concerning the regression coefficients. The estimated odds ratio for a100-hour increase of flight time in the sample of 30 subjectsis 1.1664. The associated 95% confidence intervals for the Liang-Zeger and Rogers sandwich estimators are (1.0374, 1.3120) and (0.6016, 2.2625), respectively. For the purposes of this question, the use of the Liang-Zeger or Rogers sandwich estimator impacted the statistical significance of the covariate of interest.

8. Conclusion

This research explored a novel way of building a hybrid sandwich estimator that would achieve superior performance over that of the standard Liang-Zeger sandwich estimator in settings with low outcome prevalence and reduced sample sizes. The performance of this estimator was also compared with other sandwich estimators adjusted for improved performance in small sample sizes. As the outcome prevalence dropped below 30% and the sample size below that of 50 subjects, the choice of estimators matters, and one should consider using an alternative to the Liang-Zeger estimator. In our limited simulation settings, the Rogers sandwich estimator outperformed the Liang-Zeger and typically outperformed all other estimators as the prevalence and sample size both dropped. The Rogers estimator is an extension of the Pan estimator, which also performed very well in these simulations. The performance of the Rogers estimator is dependent on the determinant calculated in the inflation factor. It is possible that the performance of the Rogers estimator may be inferior in comparison to the Pan estimator under different correlation settings. The performance of the Mancl and DeRouen sandwich estimator deteriorated to coverage probabilities only slightly better than that of the Liang-Zeger in prevalence values of 1% and 5% in sample sizes of 20 and 30 subjects. The Morel sandwich estimators, at the 1% outcome prevalence level, performed better than that of Mancl and DeRouen but not as well as the Pan or Rogers’ estimators. Overall, it is wise to select any of these other estimators, if available, over the Liang-Zeger in a situation involving low sample size or low outcome prevalence.

The true or simulated covariance structure had little bearing on the estimators’ performance. Mirrored performances were observed by all of the sandwich estimators among the three different covariance structures. This result was also observed by Mancl and DeRouen ⁴. It is likely that the simulated covariance structure played no role in the estimators’ performance due to the low correlation values used in the simulations. The correlations were kept low due to the simulation difficulties in generating large numbers of valid outcome vectors with the binary SimCLF library in small sample size and low outcome prevalence conditions. It is possible that the performance of the sandwich estimators may differ under simulation conditions using greater correlation values than were used in this project.

A similar performance was observed in our simulations to that of Pan’s, in terms of coverage probabilities, for the Liang-Zeger and Pan sandwich estimators under both the independence and compound symmetry structures ⁵. The results of Gunsolley et al.’s 1995 simulation study exploring the performance of the Liang-Zeger sandwich estimator were similar to ours: as the outcome prevalence or sample size increased, the performance of the Liang-Zeger improved, as well in terms of Type I error rates ⁷.

In summary, the performance of the Liang-Zeger sandwich estimator suffers as the sample sizes dropped below 50 subjects, and the outcome prevalence values were less than 30%. This drop off in performance is further exacerbated at the lower outcome prevalence values and smaller sample sizes. Under these extreme conditions, the Rogers and Pan estimators would be good choices for variance estimators followed by any of the two estimators proposed by Morel. The Mancl and DeRouen estimator outperformed the Liang-Zeger estimator under all outcome prevalence values as the sample size dropped below 50 subjects. With outcome prevalence values of 30% or higher and sample sizes less than 50 subjects, the Liang-Zeger estimator still consistently underestimated the coefficient variances even in these nominal conditions.

Future work will be conducted to evaluate the performances of the various sandwich estimators with higher correlations in moderate sample sizes. We will also include different numbers and types of covariates in the assessment of sandwich estimator performances.

Acknowledgements

Partial funding provided by National Institutes of Health, National Institute of General Medical Sciences [grant 1 U54GM104938].

Competing Interest

The authors declare no competing financial interests.

References

[1]	Liang, K.-Y. and S.L. Zeger, “Longitudinal Data Analysis Using Generalized Linear Models”,Biometrika, 1986. 73(1): p. 13-22.
	In article	View Article

[2]	Carroll, R.J., Wang, S., Simpson, D. G., Stromberg, A. J., and Ruppert, D.,The Sandwich (Robust Covariance Matrix) Estimator. 1998; Available from: https://www.stat.tamu.edu/ftp/pub/rjcarroll/sandwich.pdf.
	In article

[3]	Fitzmaurice, G.M., N.M. Laird, and J.H. Ware, Applied Longitudinal Analysis. Wiley series in probability and statistics. 2004, Hoboken, N.J.: Wiley-Interscience. 506 p.
	In article

[4]	Mancl, L.A. and T.A. DeRouen, “A Covariance Estimator for GEE with Improved Small-Sample Properties”,Biometrics, 2001. 57(1): p. 126-134.
	In article	View Article PubMed

[5]	Pan, W., “On the Robust Variance Estimator in Generalised Estimating Equations”,Biometrika, 2001. 88(3): p. 901-906.
	In article	View Article

[6]	King, G. and L. Zeng, “Logistic Regression in Rare Events Data”,Political Analysis, 2001. 9: p. 137-163.
	In article	View Article

[7]	Gunsolley, J.C., C. Getchell, and V.M. Chinchilli, “Small Sample Characteristics of Generalized Estimating Equations”, Communications in Statistics: Simulation and Computation, 1995. 24: p. 869-78.
	In article	View Article

[8]	Morel, J.G., “Logistic Regression Under Complex Survey Designs”,Survey Methodology, 1989. 15(2): p. 203-223.
	In article

[9]	Morel, J.G., M.C. Bokossa, and N.K. Neerchal, “Small Sample Correction for the Variance of GEE Estimators”,Biometrical Journal, 2003. 45(4): p. 395-409.
	In article	View Article

[10]	Johnson, R.A. and D.W. Wichern, Applied Multivariate Statistical Analysis. 6th ed. 2007, Upper Saddle River, N.J.: Pearson Prentice Hall. 773 p.
	In article

[11]	Qaqish, B.F., “A Family of Multivariate Binary Distributions for Simulating Correlated Binary Variables with Specified Marginal Means and Correlations”,Biometrika, 2003. 92: p. 455-463.
	In article	View Article

[12]	Peterman, C.L., Rogers, P. B., Veronneau, S. J. H., and Whinnery, J. E., “Development of an Aeromedical Scientific Information System for Aviation Safety”, Office of Aerospace Medicine 2008.Report No. DOT/FAA/AM-08/01.
	In article

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Cite this article:

Normal Style

Paul Rogers, Julie Stoner. Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings. American Journal of Applied Mathematics and Statistics. Vol. 3, No. 6, 2015, pp 243-251. https://pubs.sciepub.com/ajams/3/6/5

MLA Style

Rogers, Paul, and Julie Stoner. "Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings." American Journal of Applied Mathematics and Statistics 3.6 (2015): 243-251.

APA Style

Rogers, P. , & Stoner, J. (2015). Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings. American Journal of Applied Mathematics and Statistics, 3(6), 243-251.

Chicago Style

Like this article()

Figure 1. Coverage probabilities when estimating the regression coefficient under a simulated autoregressive covariance structure for 0.05 through 0.5 prevalence values with 0.005 correlation
View in article
Full Size Figure

Figure 2. Coverage probabilities when estimating the regression coefficientunder a simulated autoregressive covariance structure for 0.01 prevalence with 0.005 correlation
View in article
Full Size Figure

Figure 3. Coverage probabilities when estimating the regression coefficientunder a simulated autoregressive covariance structure for 0.1 through 0.5 prevalence values with 0.05 correlation
View in article
Full Size Figure

Table 1. Summary of sandwich estimators
View in article
Full Size

Table 2. Simulation design settings for each of the six estimators
View in article
Full Size

Table 3. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 500 airmen
View in article
Full Size

Table 4. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 300 airmen
View in article
Full Size

[1]	Liang, K.-Y. and S.L. Zeger, “Longitudinal Data Analysis Using Generalized Linear Models”,Biometrika, 1986. 73(1): p. 13-22.
	In article	View Article

[2]	Carroll, R.J., Wang, S., Simpson, D. G., Stromberg, A. J., and Ruppert, D.,The Sandwich (Robust Covariance Matrix) Estimator. 1998; Available from: https://www.stat.tamu.edu/ftp/pub/rjcarroll/sandwich.pdf.
	In article

[3]	Fitzmaurice, G.M., N.M. Laird, and J.H. Ware, Applied Longitudinal Analysis. Wiley series in probability and statistics. 2004, Hoboken, N.J.: Wiley-Interscience. 506 p.
	In article

[4]	Mancl, L.A. and T.A. DeRouen, “A Covariance Estimator for GEE with Improved Small-Sample Properties”,Biometrics, 2001. 57(1): p. 126-134.
	In article	View Article PubMed

[5]	Pan, W., “On the Robust Variance Estimator in Generalised Estimating Equations”,Biometrika, 2001. 88(3): p. 901-906.
	In article	View Article

[6]	King, G. and L. Zeng, “Logistic Regression in Rare Events Data”,Political Analysis, 2001. 9: p. 137-163.
	In article	View Article

[7]	Gunsolley, J.C., C. Getchell, and V.M. Chinchilli, “Small Sample Characteristics of Generalized Estimating Equations”, Communications in Statistics: Simulation and Computation, 1995. 24: p. 869-78.
	In article	View Article

[8]	Morel, J.G., “Logistic Regression Under Complex Survey Designs”,Survey Methodology, 1989. 15(2): p. 203-223.
	In article

[9]	Morel, J.G., M.C. Bokossa, and N.K. Neerchal, “Small Sample Correction for the Variance of GEE Estimators”,Biometrical Journal, 2003. 45(4): p. 395-409.
	In article	View Article

[10]	Johnson, R.A. and D.W. Wichern, Applied Multivariate Statistical Analysis. 6th ed. 2007, Upper Saddle River, N.J.: Pearson Prentice Hall. 773 p.
	In article

[11]	Qaqish, B.F., “A Family of Multivariate Binary Distributions for Simulating Correlated Binary Variables with Specified Marginal Means and Correlations”,Biometrika, 2003. 92: p. 455-463.
	In article	View Article

[12]	Peterman, C.L., Rogers, P. B., Veronneau, S. J. H., and Whinnery, J. E., “Development of an Aeromedical Scientific Information System for Aviation Safety”, Office of Aerospace Medicine 2008.Report No. DOT/FAA/AM-08/01.
	In article

Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings

Abstract

1. Introduction

2. Generalized Estimating Equations and the Sandwich Covariance Estimator

3. Summary of Small-Sample Covariance Estimators

Table 1. Summary of sandwich estimators

4. A New Hybrid Sandwich Estimator

5. Asymptotic Properties

6. Simulation Studies

Table 2. Simulation design settings for each of the six estimators

7. Practical Application

Table 3. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 500 airmen

Table 4. Estimated regression coefficients, odds ratios (OR), 95% confidence intervals (CI) and sandwich estimators from analysis of AIDS events in self-reported 100 flight hour changes for last six months for 300 airmen

8. Conclusion

Acknowledgements

Competing Interest

References

Cite this article:

Normal Style

MLA Style

APA Style

Chicago Style