Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings

Regression models for correlated binary outcomes are commonly fit using a Generalized Estimating Equations (GEE) methodology. GEE uses the Liang and Zeger sandwich estimator to produce unbiased standard error estimators for regression coefficients in large sample settings even when the covariance structure is misspecified. The sandwich estimator performs optimally in balanced designs when the number of participants is large, and there are few repeated measurements. The sandwich estimator is not without drawbacks; its asymptotic properties do not hold in small sample settings. In these situations, the sandwich estimator is biased downwards, underestimating the variances. In this project, a modified form for the sandwich estimator is proposed to correct this deficiency. The performance of this new sandwich estimator is compared to the traditional Liang and Zeger estimator as well as alternative forms proposed by Morel, Pan and Mancl and DeRouen. The performance of each estimator was assessed with 95% coverage probabilities for the regression coefficient estimators using simulated data under various combinations of sample sizes and outcome prevalence values with an Independence (IND), Autoregressive (AR) and Compound Symmetry (CS) correlation structure. This research is motivated by investigations involving rare-event outcomes in aviation data.


Introduction
Regression models with binary outcome variables are prevalent in all research disciplines. If the data are independent, then the covariance between two measured values, which is a measure of linear dependence, is zero. If the data are dependent, Generalized Estimating Equations (GEE) can be used to account for the correlation, which is a function of the covariance among repeated or clustered measurements [1]. The GEE framework contains options for the working covariance structure based upon the assumed pattern of correlation within the data. One of the strengths of using GEE is that the sandwich or robust variance estimator produces unbiased standard errors in large sample sizes for the regression coefficients even when the covariance structure is misspecified. This is a tremendous advantage, but the sandwich estimator of variance is not without drawbacks.
It is well known that the GEE methodology has issues with small sample sizes due to the asymptotic properties inherent in the covariance sandwich estimator [2,3]. Fitzmaurice et al. noted that in small or finite sample sizes, Wald tests using the Liang-Zeger sandwich estimator tend to produce p-values that are too small [3]. The sandwich estimator of variance is biased downward; that is, it underestimates the variability of parameter estimators in small sample sizes. Much research has been performed to improve the performance of GEE under these circumstances. This is evidenced in the works of Mancl and DeRouen as well as Pan [4,5]. Rare outcomes pose a problem as well. Even with a large sample size, a rare outcome can be viewed as a small sample problem. That is, the information concerning the event of interest is, by itself, a small sample. Adding records that do not have the outcome of interest gives no additional information to the model. If an event becomes rare enough, it becomes extremely difficult to collect enough information to construct an informative regression model. The problems GEE experiences with finite sample sizes can become exacerbated when coupled with a rare outcome.
Rare events defined as binary outcomes, which have tens of thousands to hundreds of thousands of non-events (zeroes) compared to the outcome of interest (ones), can be a challenge in observational studies or clinical trials. Logistic regression methods for independent data have binary or ordinal outcomes but can produce predicted probabilities that grossly underestimate the true probability of a rare event [6]. At present, very few methods are available for modeling and analyzing longitudinal rare event data. The methods currently available are models based upon the Poisson distribution and are appropriate when the dependent variable involves count data. In the rare event situations, with dependent data, the variance matrix for the regression coefficients of the standard logistic regression model is biased; the estimated variances are smaller than the true variances. Furthermore, Carroll and colleagues discovered that under rare event conditions the use of the sandwich estimator with the logistic regression model produced under coverage of Wald-type tests. In the case of logistic regression using the sandwich estimator, "an important part of sample size considerations is the number of events" [2]. In other words, decreasing sample sizes with rare outcomes can worsen the bias of the sandwich estimator.
The GEE methods are fairly robust and compensate for correlation among repeated measures or clustered data. However, in rare event and finite sample size settings, the variances and covariances generated by these models are underestimated and lead to erroneous inferences. Other investigators have proposed corrections for rare events and finite sample sizes with correlated data but there is no universally agreed upon solution for dealing with these circumstances. These solutions have resulted in alternative sandwich estimators that still have performance issues.
We propose the use of an improved sandwich estimator that has the ability to produce unbiased estimates of variances and covariances in studies of correlated data with rare event and small sample sizes. Our approach will be to adjust the sandwich estimator to compensate for underestimation in these situations. In general, this adjustment is performed by taking an alternate sandwich estimator, developed by Pan, and improving its performance in small sample size and rare event settings by adding an appropriate inflation factor, while still preserving the asymptotic nature of the sandwich estimator. The performance of this improved sandwich estimator will be evaluated with simulated and real-world datasets.

Generalized Estimating Equations and the Sandwich Covariance Estimator
In general, if i Y is a response variable and i X is a covariate of interest for 1,..... , K i = subjects, a regression model can be utilized to describe their relationship. In the case of longitudinal data, j is the index for the number of observations within a given subject. The number of repeated measurements on an individual will be represented as i n with ij n being the measurement at the th j interval for the th i subject. Marginal models are based on quasi-likelihood and are similar in form to the Generalized Linear Model (GLM) in that a link function ( ), g is used to specify a mathematical relationship, involving regression coefficients ( ) β , between a marginal mean response ( ) ij µ , and one or more independent variables ( ) ij X . will be used to represent the partial derivatives of the vector of predicted means with respect to the vector of regression coefficients ( ) β . Then i D is an i n x p matrix of these partial derivatives and appears as follows: indicate the working covariance matrix for these same measurements; i V depends on the correlation structure ( ).
i α R In the GEE method, when the dependent variable comes from the exponential family, the following are the score equations for the regression coefficients ( ' ) s β : Liang and Zeger (1986) demonstrated that as the number of subjects or clusters (K) increased in size, that β is a consistent estimator for .
β That is, as is asymptotically multivariate Gaussian with zero mean and covariance matrix ( ) LZ V estimated as follows.
( )( ) When estimates of β and α are inserted, LZ V is referred to as the empirical-based,or robust sandwich, variance matrix.

Summary of Small-Sample Covariance Estimators
The Liang-Zeger sandwich estimator ( ) LZ V is used frequently in GEE since it produces valid standard errors asymptotically, even if the covariance structure is misspecified. The degree of bias of the sandwich estimator is an asymptotic property that is reduced as the sample size, or number of independent clusters, increases.
The problems caused by rare outcomes relative to the use of GEE models were first noted by Gunsolley while exploring the performance of GEEs with binary outcomes using a compound symmetry covariance structure [7].

Pan Estimator
Pan argued that the covariance calculated within the sandwich estimator is not an optimal estimator of ( ) i Cov Y because it is based on data from the th i subject and is neither efficient nor consistent [5]. Pan proposed an improvement to the sandwich estimator by using a pooled, or averaged, covariance based upon all subjects. This enhancement depends on two assumptions to preserve the asymptotic nature of Pan's estimator: Assumption 1. The marginal variance of ij y needs to be modeled correctly. Assumption 2. There is a common correlation structure across all subjects. In reference to the sandwich estimator proposed by Liang and Zeger in equation (1), Pan proposedreplacing the and u R is a correlation matrix obtained without any parametric specification ( ) α .
Pan's sandwich estimator ( P V ), in matrix notation, can then be written as: Pan claimed this modified sandwich estimator has greater efficiency than that proposed by Liang and Zeger as the covariance is based upon all subjects. The results of his initial simulations using an exchangeable and independence covariance structure with both a binary and continuous outcome variable support this claim [5].

Mancl and DeRouen Estimator
Mancl and DeRouen proposed replacing the covariance of Liang and Zeger's (1986) sandwich estimator ( LZ V ) with one corrected for bias. That is LZ V from equation (1) becomes the bias-corrected sandwich estimator ( MD V ) [4]: where I is an i i n xn identity matrix, N V is the "naïve" or model-based variance estimator and ˆˆT Mancl and DeRouen justify this correction on the grounds that the true expected value is expressed

Morel Estimator
Morel originally explored the covariance matrix estimate in logistic regression in complex survey designs as a product of the application of a Taylor series expansion [8]. These results were later extended to the sandwich estimator used within the GEE framework [9]. They clearly delineated the source of the bias suffered by the sandwich estimator in small samplesizes. It was demonstrated that most software implementations of the sandwich estimator omit the term 1 1 This term is part of the Taylor series estimation of the sandwich estimator. The omission of these terms is less serious when the sample size or number of clusters is large but becomes increasingly significant as the sample size is reduced. Morel (2003) proposed re-introducing these terms to adjust for bias in the sandwich estimator. He also recommended inflating the sandwich estimator by adding a scaled version of the sandwich estimator trace to itself. This concept is unalike those previously proposed in that it applies the adjustment to the entire sandwich estimator. Whereas most adjustments took place inside the calculation of the covariance of the sandwich estimator, this one was applied outside of the summation and not to the individual residuals. Morel's version of the sandwich estimator, adjusted with the trace,is referred to as where: min 0.5, Simulation results supported this modified approach because Type I error rates were nominal, even in small sample sizes, unlike the unmodified GEE and modelbased covariance estimators ( ) A variant of Morel's original estimator was included in this comparative study for evaluation purposes. It is identical to the estimator described in equation (4) but was inflated with the determinant rather than the trace. Morel had originally suggested evaluating the performance of the sandwich estimator inflated with the determinant but had never done so. This paper will be the first to evaluate the performance of this variant of the sandwich estimator.
Incorporating the changes proposed by Pan (2001) and Morel (2003), with some additional modifications, a new hybrid sandwich estimator ( ) R V will be constructed. Building on a fusion of these concepts, we believe a modified GEE estimator can be constructed that delivers accurate probabilities, nominal Type I error rates, and confidence intervals with proper coverage. The performance of this hybrid sandwich estimator is compared to the estimators of Liang and Zeger (1986), Mancl and DeRouen (2001), Morel(2003), and Pan (2001). A summary of the sandwich estimators is given in Table 1.

A New Hybrid Sandwich Estimator
A new hybrid sandwich estimator was created by inflating Pan's estimator with a scaled version of the determinant. The determinant is a physical representation of the area or volume of the variances and covariances of the sandwich estimator [10]. In terms of the volume, the determinant of the sandwich estimator can be expressed as: Our recommended solution will use an averaged or pooled covariance, just as Pan (2001) did, and scale these values using the corrections originally proposed by Morel (2003).
An advantage of this strategy is that as long as the model-based estimate of variance is a positive definite matrix, the hybrid sandwich estimator will also be positive definite. Referencing Pan's version of the sandwich estimator from equation (2)

Asymptotic Properties
The asymptotic properties of the hybrid estimator follow directly from the properties of Pan's estimator. As the number of clusters increases, the Pan and Rogers estimators become more similar. To assure the asymptotic validity of his estimator, Pan needed the two assumptions, listed in section 3.1,to hold true [5].
Our estimator is similar to Pan's but with an additional inflation factor; as the number of subjects increases they converge to the same values. This can be demonstrated with the following limit; Limit 1. As the number of subjects goes to infinity K → ∞ the following will hold true: If assumption 2 does not hold, as Pan recommended, subjects can be classified into groups such that the i Y have the same correlation matrix. Therefore, as the sample size increases and the marginal variance of i Y is modeled correctly we expect the values of the Pan and Rogers sandwich estimators to be more similar. If assumptions 1 and 2 hold then with a large enough sample size we expect the differences in 1/2( ) K − β β to be asymptotically multivariate Gaussian with zero mean and covariance matrix ( ) V under the Pan and Rogers methodologies as well. In addition to these similarities, if the sample size and prevalence are both increased, we expect to see a convergence of similar values and performance in coverage probabilities from all of the sandwich estimators.

Simulation Studies
Due to the asymptotic nature of the sandwich estimators, simulations were conducted to assess their performance under varying small sample and rare event conditions. The sandwich estimators compared included the traditional Liang-Zeger ( ) The single covariate model was fit on a series of simulated datasets with outcomes of differing prevalence values (0.01, 0.05, 0.10, 0.30, 0.50). The simulations were run with data sizesof500, 100, 50, 30,and 20 subjects. The various estimators' performance was also compared when the simulated within-cluster correlation structure was either exchangeable or autoregressive, with correlation set at 0.005 or 0.05, and when observations within clusters were simulated to be independent. All simulations involve balanced designs with four observations per subject. These correlation values were selected based on the relationship between the prevalence of the outcome and the correlation among longitudinal measures. That is, the probability of the outcome restricts the range of possible correlation values [3]. Due to this relationship between the prevalence and correlation, it was not practical to simulate all combinations of prevalence values and correlations.
Simulated correlated binary data were generated with the binary SimCLF R-code library, which is based on the work by Qaqish [11]. The correlations were kept low due to the simulation difficulties in generating large numbers of valid outcome vectors with the binary SimCLF library in small sample size and low outcome prevalence conditions. That is to say, as the sample size and outcome prevalence decreased, the binarySimCLF produced a large number of vectors which failed its own validity check.In all simulations, the covariance structure was correctly specified. The total number of configurations, as well as the number of simulations is summarized in Table 2. Eachsimulation configuration was run 1,000 times, reporting the average of the sandwich estimator undergoing testing. The true values for the intercept ( )

Demonstration of Bias as a Poor Performance Measure
A measure of performance usually used in evaluating a new statistic is the bias; the difference between the estimator's true variance and its mean estimated variance. Estimators with a positive bias have underestimated the true variance, while those with a negative bias have overestimated the true variance.
Coverage probabilities, which are related to confidence intervals, are an alternative way of assessing performance. These confidence intervals are centered around the estimated regression coefficients( ) β , which are the same for each covariance estimation method in this simulation study. Therefore, the coverage probability in this study is only a function of the estimated variance.
The simulation environment was designed to reproduce coverage probabilities analogous to a 95% confidence interval. After completion of the simulations, the distributions of the variance estimates created by each sandwich estimator in small samples were skewed. For all covariance structures, as the simulated prevalence and sample size diminish, the distribution of the variances for each sandwich estimator becomes steeper on the lower end and right-skewed, both to a different degree. The implication is that the distribution of the variances is no longer symmetric, and the mean is no longer in the center of the distribution under these extreme conditions. These differences are so great that measures of bias are not adequate performance indicators and therefore, coverage probabilities will be reported as the performance measure.

Coverage Probabilities
The coverage probability results are only shown for the autoregressive covariance structures as the results were similar among the three types of correlation. Coverage probabilities were similar between 1 β and the intercept 0 ( ) β therefore, figures are only shown for the 1 β regression coefficient. A composite figure is used for outcome prevalence values of 5% through 50% under a .005 correlation. A single graph is dedicated to the 1% prevalence level to highlight the differences that occur under these extreme conditions. When the correlation increases to .05,a composite figure is again used to display prevalence values of 10%, 30%, and 50%. Coverage probabilities are displayed in Figure1 through 3 for the autoregressive covariance structure. The coverage probability of our estimator for the regression coefficient (Figure 1) is very competitive with that of Pan's. These two estimators outperform the remaining estimators at 20 and 30 subjects under a 5% prevalence.
At 10%, 30%, and 50% prevalence values, the performance of the estimators begin to cluster and converge with increasing sample size, while the Liang-Zeger estimator lags behind the rest at fewer than 50 subjects.  When the simulated correlation is .05, a similar trend can be seen for the regression coefficient ( Figure 3). As the sample size and prevalence decrease, the estimators begin to diverge from one another in their performance. Typically, the Liang-Zeger estimator lags behind the others as it underestimates the variance, which decreases its coverage probability. At an outcome prevalence of 10%, our proposed estimator performs better than the other estimators. It is interesting to note that when a simulated outcome prevalence is as low as 10% is coupled with a sample size of 100, all the sandwich estimators underestimate the true variance. As the outcome prevalence increases to 50%, our estimator slightly overestimates the variance at sample sizes of 50 subjects or fewer.

Practical Application
In this section, we demonstrate the application of our sandwich estimator in a practical setting. The two datasets used are random samples of size 500 and 30 airmen, sampled independently, from the Federal Aviation Administration's Decision Support Systems (DSS) and constructed as a longitudinal dataset, as described by Peterman [12]. Airmen undergo a flight physical from an Aviation Medical Examiner (AME) and must meet certain physical requirements to hold a Class I, II, or III medical certificate. The random samples taken from the DSS were restricted to airmen who took a Class I, II, or III flight physical in each of the four years over 2002-2005. The outcome of interest, expressed as a binary variable, concerns the occurrence of an Accident and Incident Data System (AIDS) event, which can include anything from a major aircraft accident to a minor incident with only slight damage. The covariate of interest, a continuous variable, indicates the number of flight hours over the last six months self-reported by the airmen at the time of their last medical exam. The results should give us insight as to whether the number of accumulated flight hours over the last six months is associated with the occurrence of an AIDS event. The question of interest is represented in equation (6), where Y represents a binary outcome, with a one and zero indicating the occurrence or lack of an AIDS event, respectively.
The outcome's prevalence, slightly under 0.5%, is lower than those investigated in our simulation study which was 1% or higher. Among all years for the sample of 500 pilots, the median flight time was 26.62 hours (inter quartile range: 24.43-29.16 hours). In the sample size of 500 subjects, one subject reporteda flight time of 20,750 hours. This outlier was omitted and imputed as the average of the three previously reported flight hours for the previous six months (300 hours).
An autoregressive correlation structure reflects correlation decay with increasing intervals of time between measurements. Use of an autoregressive structure was reasonable, given the design of the study. The analytical results for the 500 and 30 subjects are displayed in Table 3 and Table 4, respectively. In our analysis of 500 subjects, the differences among the variances of the sandwich estimators for the covariate of interest 1 ( ) β from equation (6)

Conclusion
This research explored a novel way of building a hybrid sandwich estimator that would achieve superior performance over that of the standard Liang-Zeger sandwich estimator in settings with low outcome prevalence and reduced sample sizes. The performance of this estimator was also compared with other sandwich estimators adjusted for improved performance in small sample sizes. As the outcome prevalence dropped below 30% and the sample size below that of 50 subjects, the choice of estimators matters, and one should consider using an alternative to the Liang-Zeger estimator. In our limited simulation settings, the Rogers sandwich estimator outperformed the Liang-Zeger and typically outperformed all other estimators as the prevalence and sample size both dropped. The Rogers estimator is an extension of the Pan estimator, which also performed very well in these simulations. The performance of the Rogers estimator is dependent on the determinant calculated in the inflation factor. It is possible that the performance of the Rogers estimator may be inferior in comparison to the Pan estimator under different correlation settings. The performance of the Mancl and DeRouen sandwich estimator deteriorated to coverage probabilities only slightly better than that of the Liang-Zeger in prevalence values of 1% and 5% in sample sizes of 20 and 30 subjects. The Morel sandwich estimators, at the 1% outcome prevalence level, performed better than that of Mancl and DeRouen but not as well as the Pan or Rogers' estimators. Overall, it is wise to select any of these other estimators, if available, over the Liang-Zeger in a situation involving low sample size or low outcome prevalence.
The true or simulated covariance structure had little bearing on the estimators' performance. Mirrored performances were observed by all of the sandwich estimators among the three different covariance structures. This result was also observed by Mancl and DeRouen [4]. It is likely that the simulated covariance structure played no role in the estimators' performance due to the low correlation values used in the simulations. The correlations were kept low due to the simulation difficulties in generating large numbers of valid outcome vectors with the binary SimCLF library in small sample size and low outcome prevalence conditions. It is possible that the performance of the sandwich estimators may differ under simulation conditions using greater correlation values than were used in this project.
A similar performance was observed in our simulations to that of Pan's, in terms of coverage probabilities, for the Liang-Zeger and Pan sandwich estimators under both the independence and compound symmetry structures [5]. The results of Gunsolley et al.'s 1995 simulation study exploring the performance of the Liang-Zeger sandwich estimator were similar to ours: as the outcome prevalence or sample size increased, the performance of the Liang-Zeger improved, as well in terms of Type I error rates [7].
In summary, the performance of the Liang-Zeger sandwich estimator suffers as the sample sizes dropped below 50 subjects, and the outcome prevalence values were less than 30%. This drop off in performance is further exacerbated at the lower outcome prevalence values and smaller sample sizes. Under these extreme conditions, the Rogers and Pan estimators would be good choices for variance estimators followed by any of the two estimators proposed by Morel. The Mancl and DeRouen estimator outperformed the Liang-Zeger estimator under all outcome prevalence values as the sample size dropped below 50 subjects. With outcome prevalence values of 30% or higher and sample sizes less than 50 subjects, the Liang-Zeger estimator still consistently underestimated the coefficient variances even in these nominal conditions. Future work will be conducted to evaluate the performances of the various sandwich estimators with higher correlations in moderate sample sizes. We will also include different numbers and types of covariates in the assessment of sandwich estimator performances.