Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors

Veiw figure View Table

View next table

2.1. Recall Error and Its Characteristics

As mentioned earlier, retrospectively reported data have the drawback that people may not accurately report when events occur ^{[15, 18]}. It is a stylized fact that, when people report the timing of events that happened in the distant past, they tend to round up or down the year or time since the event occurred. Consequently, events tend to be “heaped” on multiples of chronological or calendar units (e.g. on units of five or ten for data that naturally occur over years). In case of retrospectively reported ages of weaning, the data commonly display marked heaping at durations 6, 12, 18 and 24 months. We illustrate this by plotting the proportion of births for both the status category: children who are weaned and children who are still breastfed against their age in months in Figure 1.

Figure 1 demonstrates that heaping of ages at weaning is pronounced only for those children who are already weaned at the time of interview. At the same time, we also have the following hypothesis to be validated from the sample at hand:

H₀₁: Heaping of weaning time at multiple of 6 months is genuine and it is due to the existing social norms.

Secondly, people are notoriously poor at recalling events and the timing of events ^[11]. In general, more errors occur the greater is the time-lag between an event and its recall. Thus, the second hypothesis to be validated is as following:

H₀₂: For the weaned children, the greater is the time lag between weaning and its recall, the higher is the heaping.

Figure 1. Histograms of Weaning durations for weaned and currently breastfeeding children

Download as

View current figure in a new window

Figures index

Veiw figure View Figure

View next figure

2.1.1. Is Heaping a Result of True Behaviour?

If heaping is resulted from a true behavior, such as, social norms of weaning at ages multiple of 6 months, denoted as T6, we would then find (i) substantially more masses at T6 than at T6-1 or T6 + 1, and (ii) significant differences between the masses at T6-1 and T6+1 months. Table 2 below provides the probabilities of weaning at ages multiple of 6 months along with its preceding and succeeding values. Also, the p-values of testing the hypothesis:

Table 2. Probability of weaning at ages multiple of 6 months

Download as

Veiw figure View Table

View previous table

View next table

Results are indicative of the fact that there is no strong evidence against the equality of proportions of weaning at ages T6-1 and T6+1. Moreover, if most of the children are weaned at ages T6 i.e. multiple of 6 months, we should have observed p_I[T6-1] > p_I[T6+1]; which has not happened for the present sample.

2.1.2. Age Dependent Heaping of Weaning Times

As earlier, if T6 denote the age that is a multiple of 6 months, let I[T6] be an indicator of the reported weaning age is a multiple of 6 months; so that, I[T6] = 1, if the reported weaning time is a multiple of 6 months and 0 otherwise. Our objective here is to test the hypothesis H₀₂ as stated earlier. To formally carry this out, we fit the logistic regression model

where ‘currentAge’ is the proxy covariate for the time lag between the timing of weaning and the date of interview at recall of the event. The results of the test

is provided in Table 3. We find that the effect of this proxy covariate is highly significant.

Table 3. Estimate of the effect recall age on heaping

Download as

Veiw figure View Table

View previous table

View next table

3. Density Function Estimators

We consider non-parametric procedure for estimating the probability S(x) of surviving to time x, using a random sample X₁, X₂, …, X_n of death times from a distribution F(x). The X_i are censored on the right by random variables C_i, so that one observes only min(X_i, C_i) = Y_i, i=1,…..,n. The C_i’s constitute a random sample, drawn independently of the X_i, from a distribution G(c). We let Y₍₁₎≤ ….≤ Y_(n) denote the ordered observations, and let be an indicator for the event that Y_(i) is uncensored. Kaplan and Meier ^[9] developed the nonparametric estimator of S(x) as

(3.1)

where r_i = # alive at time Y_(i)-, d_i = # died at time Y_(i).

Greenwood’s formula for the variance of the survival function

(3.2)

The endpoints of a 100(1-α)% confidence interval for S(x) on the cumulative hazard or log-survival scale is given by

(3.3)

Figure 2 shows the plots of Kaplan-Meier and Nelson-Aalen estimates of survival function and the jumps in these estimates may be observed at times multiple of 6 months as discussed earlier.

Figure 2. (a) Kaplan-Meier Survival Curve with 95% confidence band; (b) Comparison of K-M and Nelson –Aalen Estimates of Survival

Download as

View current figure in a new window

Figures index

Veiw figure View Figure

View previous figure

View next figure

A kernel density estimator of f_Xcan be motivated through the Kaplan-Meier estimator of the distribution function F_X, which is given by

The kernel estimator of f_Y(x) induced by is then

(3.4)

where s_j is the size of the jump of at Y_(j).

We now assume that the observable times Y₁, Y₂, ……,Y_n are contaminated with non-negligible recall error such that

and, for each i, Z_i is a random variable that is independent of Y_i and has known density f_Z, which we call the error density. If we apply the ordinary kernel estimate to the W₁, …, W_n then we will obtain a consistent estimate of convolution

rather than f_Y which we aim to estimate. Estimation of f_Y requires that we take into account the fact that it is convolved with f_Z to give the density of the error contaminated data. Thus the estimation of f_Y is a problem of deconvolution type. A kernel type solution is obtained by using Fourier transform (or characteristic function) properties and noting that

where is used to denote the c.f. of a density g. According to the Fourier inversion theorem, the target density can be written as

provided An estimate of f_Y(y) is obtained by replacing f_W by its kernel estimator

to obtain

which is the deconvolving kernel density estimator ^[17]. It can be shown ^[19] that the deconvolving kernel density estimator of the target density is

(3.5)

where

(3.6)

being the characteristic function of the kernel K used in estimating Thus, the kernel is to be used for estimating f_Y instead of K. This effective kernel differs from K in that its shape depends on the bandwidth. We now use of (3.6) to rewrite (3.4) as

(3.7)

3.1. Simulation Study of Small sample Bias

We generate data through simulation to examine the small sample bias of the estimator in (3.7). The effective kernel is obtained for two different error density functions. When the error density is Laplacian

and that K(x) = Φ(x), the standard normal kernel, the effective kernel for deconvolution of Laplacian error is

(3.1.1)

Table 4. Small sample Bias B₀, B_L and B_N for three different sample sizes and three different censoring percentages at selected time points

Download as

Veiw figure View Table

View previous table

Supposing instead that the error variable has a N(0,σ²) distribution, the effective kernel would be

(3.1.2)

Further, we use and . The choices of pair of values for give rise to desired censoring proportion to indicate no censoring, moderate censoring and heavy censoring. Observations have been simulated from F_X and G_C for three different sample sizes n=30, 60, 100. The pulling mechanism to contaminate the simulated data is defined through the following function

Table 4 presents the results showing bias at selected time points. We denote by B₀, B_L and B_N the bias due to the density estimate in (3.4) without accounting for recall error, the bias due to the Laplacian error corrected density estimate and the bias due to the normal error corrected density estimate respectively. Table 4 illustrates the following general findings: (i) All estimators are fairly unbiased. (ii) The kernel density estimator by smoothing the Kaplan-Meier distribution function.

3.2. Density Estimate from Duration of Breastfeeding Data

We now obtain smooth density estimates from duration of breastfeeding data of NFHS-3. Figure 3(a) provides smooth estimates of density of breastfeeding duration for two different choices of bandwidths (h = 2.5, 4.5). Figure 3(b) compares (i) smoothed estimate from Kaplan-Meier distribution function, (ii) estimate using Laplacian error correction, and (iii) estimate using normal error correction. It is well observed from Figure 3(a) that for a smaller choice of smoothing parameter (bandwidth = 2.5), the estimate of density is marked by the peaks at ages 12, 24, 36, and 48 months. A higher value (bandwidth = 4.5) however, is capable of smoothing the peaks which are thought to be not genuine. The right skewness of the distribution of weaning time is evident from the plots. This is quite reasonable as only few children are breastfed for longer durations. The median age at weaning is estimated to be 24 months with a 95% confidence interval (23.9, 24.1). Overall, an estimated proportion of about 60% children were continued their breastfeeding till the age of 24 months in the north eastern states of India.

The plots in Figure 3(b) show that there has been a locational shift in the density curve after the error correction. Both normal and laplacian error models try to shift the error corrected density towards the left. However, there is hardly any difference in the error corrected density due to the selection of the normal and laplacian error models.

The error corrected smooth density sharply rises reaching its maximum approximately at 12 months, followed by a plateau till 24 months and then trails off gradually. It describes well the situation in the whole population that more than three fourth of the children continue their breastfeeding till the age of 12 months. The long right tail is a result of the few subjects who had long breastfeeding experiences.

Figure 3. (a) Heaped and Smooth Density of Weaning Time; (b) Error Corrected Densit

Download as