Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors

T. K. Chakrabarty

  Open Access OPEN ACCESS  Peer Reviewed PEER-REVIEWED

Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors

T. K. Chakrabarty

Department of Statistics, North Eastern Hill University, Shillong, India

Abstract

Demographic and Health Surveys (DHS) collect information on several landmark events retrospectively from the life or birth histories and recollections of past events of individuals. Retrospective information of the sort is known to be affected by recall errors which result in the misplacement of dates, and the distortion of reports of duration. For example, the retrospectively reported ages of weaning for all births that occurred during the three or five years preceding the survey are right censored and commonly display marked heaping at durations 12, 18 and 24 months. The present article at first tries to understand whether the heaping is a result of true behavior and societal norms or it is an age dependent outcome. Further, under an additive error model, a kernel-type deconvolving density estimator of weaning time is proposed by smoothing the increments of Kaplan-Meier (KM) cumulative distribution function. Using simulated data it has been shown that in small and moderately censored samples these estimators can reduce the bias substantially. Finally, an empirical illustration is provided using National Family Health Survey (NFHS-3, 2005-06) data from India.

At a glance: Figures

Cite this article:

  • Chakrabarty, T. K.. "Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors." American Journal of Applied Mathematics and Statistics 2.6 (2014): 416-422.
  • Chakrabarty, T. K. (2014). Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors. American Journal of Applied Mathematics and Statistics, 2(6), 416-422.
  • Chakrabarty, T. K.. "Deconvolving Kernel Density Estimation of Right Censored Duration Data with Recall Errors." American Journal of Applied Mathematics and Statistics 2, no. 6 (2014): 416-422.

Import into BibTeX Import into EndNote Import into RefMan Import into RefWorks

1. Introduction

Duration data or time to event data occur in various fields of biological, engineering and social sciences. In demographic event history analysis, for instance, a primary focus has always been the timing and occurrence of events and consequences of these events for population stock. Data on the timing and date of events are often collected retrospectively [11] by interview, usually by conducting a survey or census for a wide variety of topics. Among others, these topics include marital events collected through marital histories in US Panel Study of Income Dynamics (PSID) [15], births from fertility histories in World Fertility Survey (WFS) [16], post-partum amenorrhea (the interval after a pregnancy before menstruation returns) in the Malaysian Family Life Surveys (MFLS) [4], maternal and child health and nutrition in Demographic and Health Surveys (DHS) [1, 3].

The probability distribution of the time till occurrence of an event is useful for several important reasons. For instance, comparisons of decline in average durations of breastfeeding in developing countries and thereby recommending breastfeeding promotion high in their health policy agenda [18]; comparing the duration and stability of marital and non-marital unions in developed countries [6]; understanding of the factors which affect the timing and patterns of differential progression to the third birth [5, 12] etc. Estimation of this distribution is an important inferential issue [10].

Distributions of event durations or times can be estimated from either current-status or retrospective history data [8]. Current-status data comprise information on whether the milestone has or has not been reached at the time of a survey. If the milestone has been reached, we have incomplete information on when this occurred. On the other hand, we do not know when it will be reached (if ever) for those respondents who have not achieved the milestone at the time of survey. Current –status data thus correspond to the extreme situation where all the survival time data are either right-censored or left-censored. While these data constraints are restrictive, it is still possible to estimate the distribution using parametric models [13, 14], and to estimate some non-parametric models [10].

Retrospective data on the other hand, comprise information from the respondents that the milestone (1) occurred at a certain age or (2) has not yet occurred, i.e. right-censored data (age at milestone > age at survey). The major attraction of retrospective data is the obvious ease of collection and researchers often opt for cross-sectional studies in order to save time and cost. Despite their advantages, retrospectively reported data also have the drawback that people may not accurately report when events occur. It is a stylized fact that, when people report the timing of events that happened in the distant past, they tend to round up or down the year or time since the event occurred. Consequently, events tend to be “heaped” on multiples of chronological or calendar units. In case of age in months at weaning for instance, as we shall see in section 2 later, retrospectively reported ages of weaning commonly display marked heaping at durations 6, 12, 18 and 24 months.

Various studies concerning the distribution of weaning times [8, 18] have observed that the current-status measures lead to unbiased estimates of the survival function for a sample of births that occur during a fixed period. The estimates of the survival function so obtained, however, often fail to decline monotonically and are subject to larger variations. Retrospective history data are to be preferred because the sample size used to estimate the hazard rate at each duration is larger than is the case for current- status information, so that sampling variation is lower. If, however, the accuracy of the retrospective information is suspect, then the current-status data may be preferable despite the higher sampling variability. Thus, the method of analysis must be chosen by taking into account knowledge of the accuracy of a specific data set and an assessment of the extent of misreporting of duration information.

While dealing with the retrospectively reported right censored duration data, the present article proposes to examine whether recall errors and heaping of ages is universal. Whether such heaping is genuine for some obvious reasons such as societal norms or it is dependent on time since occurrence of the event. The main purpose of this article is to obtain probability density function of the right censored duration variable from the data contaminated with recall errors. The paper is organized as follows. In section 2, I discuss the sources of data used in the present work and some of the methodological issues in analyzing the information on breastfeeding from large scale health surveys. Kaplan-Meier and Nelson-Aalen estimates of survival functions are obtained in section 3 using data from the last two National Family Health Surveys for six north-eastern states in India. Following this, I propose a kernel-type deconvolving density estimator [2, 19] of durations of breastfeeding under non-negligible additive type recall error and has been applied to the data mentioned above.

2. Data on Weaning Time

Demographic and Health Surveys (DHS) collect information on timing and age of several landmark events by a representative cross-sectional sample of ever married women and their children retrospectively from the life or birth histories and recollections of past events. For instance, respondent mothers are asked the following questions for each of her children born during the last three/five years before the survey:

a. Did you ever breastfed the child?

b. Are you still breastfeeding the child?

c. For how many months did you breastfeed the child?

The dichotomous answers (“yes” or “no”) from the first two questions are known as current-status data. In addition, if respondents answer the age at which the child is weaned when the answer for (b) is a “no”, it is called retrospective data. When the child is still breastfeeding at the time of survey, the number of months the child already breastfed is taken as a right censored observation for the weaning time.

For the present analysis, the information on breastfeeding in eight north-eastern states of India: Arunachal Pradesh, Assam, Manipur, Meghalaya, Mizoram, Nagaland, Sikkim and Tripura have been pooled based on National Family Health Survey (NFHS-3) [7]. The NFHS-3, conducted during December 2005 to August 2006, gathered information on 21,843 ever-married women aged 15 – 49. Information on breastfeeding was collected for the children of interviewed women born in the five years preceding NFHS-3. For any given woman, a maximum of three births were included in the analysis of NFHS-3. For a total of 7600 children information on breastfeeding duration was available, of whom 3759 (49.5%) children were still breast feeding at the date of interview or were breastfed until died. The rest of the children (50.5%) completed their breast feeding. The duration of breastfeeding for the births who were still breastfeeding was calculated as the difference between their birth dates and the date of the survey. Table 1 below, reports the state wise distribution of number of children by the status of completeness of information.

Table 1. Statewise Distribution of Sample Data

2.1. Recall Error and Its Characteristics

As mentioned earlier, retrospectively reported data have the drawback that people may not accurately report when events occur [15, 18]. It is a stylized fact that, when people report the timing of events that happened in the distant past, they tend to round up or down the year or time since the event occurred. Consequently, events tend to be “heaped” on multiples of chronological or calendar units (e.g. on units of five or ten for data that naturally occur over years). In case of retrospectively reported ages of weaning, the data commonly display marked heaping at durations 6, 12, 18 and 24 months. We illustrate this by plotting the proportion of births for both the status category: children who are weaned and children who are still breastfed against their age in months in Figure 1.

Figure 1 demonstrates that heaping of ages at weaning is pronounced only for those children who are already weaned at the time of interview. At the same time, we also have the following hypothesis to be validated from the sample at hand:

H01: Heaping of weaning time at multiple of 6 months is genuine and it is due to the existing social norms.

Secondly, people are notoriously poor at recalling events and the timing of events [11]. In general, more errors occur the greater is the time-lag between an event and its recall. Thus, the second hypothesis to be validated is as following:

H02: For the weaned children, the greater is the time lag between weaning and its recall, the higher is the heaping.

Figure 1. Histograms of Weaning durations for weaned and currently breastfeeding children

2.1.1. Is Heaping a Result of True Behaviour?

If heaping is resulted from a true behavior, such as, social norms of weaning at ages multiple of 6 months, denoted as T6, we would then find (i) substantially more masses at T6 than at T6-1 or T6 + 1, and (ii) significant differences between the masses at T6-1 and T6+1 months. Table 2 below provides the probabilities of weaning at ages multiple of 6 months along with its preceding and succeeding values. Also, the p-values of testing the hypothesis:

Table 2. Probability of weaning at ages multiple of 6 months

Results are indicative of the fact that there is no strong evidence against the equality of proportions of weaning at ages T6-1 and T6+1. Moreover, if most of the children are weaned at ages T6 i.e. multiple of 6 months, we should have observed pI[T6-1] > pI[T6+1]; which has not happened for the present sample.


2.1.2. Age Dependent Heaping of Weaning Times

As earlier, if T6 denote the age that is a multiple of 6 months, let I[T6] be an indicator of the reported weaning age is a multiple of 6 months; so that, I[T6] = 1, if the reported weaning time is a multiple of 6 months and 0 otherwise. Our objective here is to test the hypothesis H02 as stated earlier. To formally carry this out, we fit the logistic regression model

where ‘currentAge’ is the proxy covariate for the time lag between the timing of weaning and the date of interview at recall of the event. The results of the test

is provided in Table 3. We find that the effect of this proxy covariate is highly significant.

Table 3. Estimate of the effect recall age on heaping

3. Density Function Estimators

We consider non-parametric procedure for estimating the probability S(x) of surviving to time x, using a random sample X1, X2, …, Xn of death times from a distribution F(x). The Xi are censored on the right by random variables Ci, so that one observes only min(Xi, Ci) = Yi, i=1,…..,n. The Ci’s constitute a random sample, drawn independently of the Xi, from a distribution G(c). We let Y(1) ≤ ….≤ Y(n) denote the ordered observations, and let be an indicator for the event that Y(i) is uncensored. Kaplan and Meier [9] developed the nonparametric estimator of S(x) as

(3.1)

where ri = # alive at time Y(i)-, di = # died at time Y(i).

Greenwood’s formula for the variance of the survival function

(3.2)

The endpoints of a 100(1-α)% confidence interval for S(x) on the cumulative hazard or log-survival scale is given by

(3.3)

Figure 2 shows the plots of Kaplan-Meier and Nelson-Aalen estimates of survival function and the jumps in these estimates may be observed at times multiple of 6 months as discussed earlier.

Figure 2. (a) Kaplan-Meier Survival Curve with 95% confidence band; (b) Comparison of K-M and Nelson –Aalen Estimates of Survival

A kernel density estimator of fX can be motivated through the Kaplan-Meier estimator of the distribution function FX, which is given by

The kernel estimator of fY(x) induced by is then

(3.4)

where sj is the size of the jump of at Y(j).

We now assume that the observable times Y1, Y2, ……,Yn are contaminated with non-negligible recall error such that

and, for each i, Zi is a random variable that is independent of Yi and has known density fZ, which we call the error density. If we apply the ordinary kernel estimate to the W1, …, Wn then we will obtain a consistent estimate of convolution

rather than fY which we aim to estimate. Estimation of fY requires that we take into account the fact that it is convolved with fZ to give the density of the error contaminated data. Thus the estimation of fY is a problem of deconvolution type. A kernel type solution is obtained by using Fourier transform (or characteristic function) properties and noting that

where is used to denote the c.f. of a density g. According to the Fourier inversion theorem, the target density can be written as

provided An estimate of fY(y) is obtained by replacing fW by its kernel estimator

to obtain

which is the deconvolving kernel density estimator [17]. It can be shown [19] that the deconvolving kernel density estimator of the target density is

(3.5)

where

(3.6)

being the characteristic function of the kernel K used in estimating Thus, the kernel is to be used for estimating fY instead of K. This effective kernel differs from K in that its shape depends on the bandwidth. We now use of (3.6) to rewrite (3.4) as

(3.7)
3.1. Simulation Study of Small sample Bias

We generate data through simulation to examine the small sample bias of the estimator in (3.7). The effective kernel is obtained for two different error density functions. When the error density is Laplacian

and that K(x) = Φ(x), the standard normal kernel, the effective kernel for deconvolution of Laplacian error is

(3.1.1)

Table 4. Small sample Bias B0, BL and BN for three different sample sizes and three different censoring percentages at selected time points

Supposing instead that the error variable has a N(0,σ2) distribution, the effective kernel would be

(3.1.2)

Further, we use and . The choices of pair of values for give rise to desired censoring proportion to indicate no censoring, moderate censoring and heavy censoring. Observations have been simulated from FX and GC for three different sample sizes n=30, 60, 100. The pulling mechanism to contaminate the simulated data is defined through the following function

Table 4 presents the results showing bias at selected time points. We denote by B0, BL and BN the bias due to the density estimate in (3.4) without accounting for recall error, the bias due to the Laplacian error corrected density estimate and the bias due to the normal error corrected density estimate respectively. Table 4 illustrates the following general findings: (i) All estimators are fairly unbiased. (ii) The kernel density estimator by smoothing the Kaplan-Meier distribution function.

3.2. Density Estimate from Duration of Breastfeeding Data

We now obtain smooth density estimates from duration of breastfeeding data of NFHS-3. Figure 3(a) provides smooth estimates of density of breastfeeding duration for two different choices of bandwidths (h = 2.5, 4.5). Figure 3(b) compares (i) smoothed estimate from Kaplan-Meier distribution function, (ii) estimate using Laplacian error correction, and (iii) estimate using normal error correction. It is well observed from Figure 3(a) that for a smaller choice of smoothing parameter (bandwidth = 2.5), the estimate of density is marked by the peaks at ages 12, 24, 36, and 48 months. A higher value (bandwidth = 4.5) however, is capable of smoothing the peaks which are thought to be not genuine. The right skewness of the distribution of weaning time is evident from the plots. This is quite reasonable as only few children are breastfed for longer durations. The median age at weaning is estimated to be 24 months with a 95% confidence interval (23.9, 24.1). Overall, an estimated proportion of about 60% children were continued their breastfeeding till the age of 24 months in the north eastern states of India.

The plots in Figure 3(b) show that there has been a locational shift in the density curve after the error correction. Both normal and laplacian error models try to shift the error corrected density towards the left. However, there is hardly any difference in the error corrected density due to the selection of the normal and laplacian error models.

The error corrected smooth density sharply rises reaching its maximum approximately at 12 months, followed by a plateau till 24 months and then trails off gradually. It describes well the situation in the whole population that more than three fourth of the children continue their breastfeeding till the age of 12 months. The long right tail is a result of the few subjects who had long breastfeeding experiences.

Figure 3. (a) Heaped and Smooth Density of Weaning Time; (b) Error Corrected Densit

4. Concluding Remarks

In this paper, a well-recognized feature of retrospectively collected right censored duration data on events that occur in the distant past - heaping on some natural time unit has been investigated. It is demonstrated that such heaping is pronounced for the subjects with completed outcome. Two important hypotheses characterizing this feature (1) genuineness and (2) time dependence has been validated. It has been observed that there is no evidence of heaping is genuine due to social norm and practice. Also, the pattern is age dependent i.e. the greater is the time lag of reporting, the more the heaping is.

Having verified the presence of non-negligible recall error in the data we have computed smooth error corrected density under an additive error model. The kernel estimator of density function for breastfeeding duration proposed in this article use the idea of convolving a kernel weight with the density estimates induced by the natural estimate of the cumulative distribution function. The estimator has attractive mean squared error property [19] and is pointwise consistent [2].

For large samples, the effect of bias correction has been found to be little, subject to the appropriate selection of bandwidth parameter. One can make judicious use of any of the several hi-tech bandwidth selectors to implement the proposed estimator.

The methodology proposed in this article is of utmost practical use for the nations [18], where government intervention programmes are under process to increase the percentage of exclusive breastfeeders till six months and it is necessary to evaluate the efficacy of such programmes through bias free estimates of proportion of weaned cases.

References

[1]  Anandaiah, R. and Choe M. K. (2000). Are the WHO guidelines on Breastfeeding Appropriate for India? National Family Health Survey Subject Reports No. 16, International Institute for Population Sciences, Mumbai, India.
In article      PubMed
 
[2]  Anderson, P.K.,Borgan, Ø., Gill, R.D., and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag New York, Inc.
In article      CrossRef
 
[3]  Arnold, F., Parasuraman, S., Arokiasamy, P., and Kothari, M. (2009). Nutrition in India. National Family Health Survey (NFHS-3), India, 2005-06. Mumbai: International Institute for Population Sciences; Calverton, Maryland, USA: ICF Macro.
In article      
 
[4]  Beckett, M., Vanzo, J.D., Sastry, N., Panis, C.,and Pterson, C. (2001). The quality of retrospective data: an examination of long termrecall in a developing country. Journal of Human Resource, 36(3): 593-625.
In article      CrossRef
 
[5]  Heckman, J.J., and Walker, J.R. (1992). Understanding third births in Sweden. In: Demographic applications of event history analysis, Trussell, J., Hankinson, R., and Tilton, J., (Eds.), Oxford University Press, Oxford, pp: 157-208.
In article      
 
[6]  Hoem, b. and Hoem, J. N. (1992). The disruption of marital and non-marital unions in contemporary Sweden. In: Demographic applications of event history analysis, Trussell, J., Hankinson, R., and Tilton, J.,(Eds.), Oxford University Press, Oxford, pp: 61-93.
In article      
 
[7]  International Institute for Population Sciences and ORC Macro. (2002). National Family Health Survey (NFHS-3), 2005-06: Northeastern States, IIPS, Bombay.
In article      
 
[8]  John, A.M., Menken, J,A., and Trussell, J. (1988). Estimating the distribution of interval lemgth: current status and retrospective history data. Population Studies 42: 115-127.
In article      CrossRef
 
[9]  Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incomplete observations, Journal of American Statistical Association, 53:457-481.
In article      CrossRef
 
[10]  Mirzaei, S.S., and Sengupta, D. (2013). Parametric estimation of menarcheal age distribution based on recall data. Thecnical Report No. ASD/2013/3, Applied Statistical Unit, Indian Statistical Institute, 3. Available at www.isical.ac.in/asu/TR/TechRepASD201303.pdf.
In article      
 
[11]  Moss, L., and Goldstein, H. (1985). (eds.). The Recall Method in Social Surveys, Heinemann, Portsmouth.
In article      
 
[12]  Murphy, M. (1992). The progression to the third birth in Sweden. In: Demographic applications of event history analysis, Trussell, J., Hankinson, R., and Tilton, J.,(Eds.), Oxford University Press, Oxford, pp: 141-156.
In article      
 
[13]  Nelson, W. (1978). Life Data Analysis for Units Inspected for Failure (Quantal Response Data). IEEE Transactions on Reliability, R-27, 274-279.
In article      CrossRef
 
[14]  Nelson, W. (1982). Applied Life data Analysis. John Wiley & Sons, New York.
In article      CrossRef
 
[15]  Peters, H. E. (1988). "Retrospective Versus Panel Data in Analyzing Lifecycle Events." Journal of Human Resources 23(4): 488-513.
In article      CrossRef
 
[16]  Smith, D.J. and Ferry, B. (1984). Correlates of Breastfeeding: World Fertility Survey, Comparative Studies, 41, Int. Stat. Inst., Voorburg.
In article      
 
[17]  Stefanski, L. and Carroll, R.J. (1990) Deconvolving kernel density estimators. Statistics 2, 169-184.
In article      CrossRef
 
[18]  Trussell, J., Strawn, L.G., Rodriguez, G., and Vanlandingham, M. (1992). Trends and differentials in breastfeeding behavior: Evidence from WFS and DHS, Population Studies, 46:285-307.
In article      CrossRef
 
[19]  Wand, M.P., and Jones, M.C. (1995). Kernel Smoothing. Chapmann & Hall/CRC, www.crcpress.com.
In article      
 
  • CiteULikeCiteULike
  • MendeleyMendeley
  • StumbleUponStumbleUpon
  • Add to DeliciousDelicious
  • FacebookFacebook
  • TwitterTwitter
  • LinkedInLinkedIn