Article Versions
Export Article
Cite this article
  • Normal Style
  • MLA Style
  • APA Style
  • Chicago Style
Research Article
Open Access Peer-reviewed

Semi-Parametric Models for Longitudinal Data Analysis

Liu Yang , Xu-Feng Niu
Journal of Finance and Economics. 2021, 9(3), 93-105. DOI: 10.12691/jfe-9-3-1
Received April 23, 2021; Revised May 27, 2021; Accepted June 07, 2021

Abstract

Longitudinal studies are widely used in various fields, such as public health, clinic trials and financial data analysis. A major challenge for longitudinal studies is the repeated measurements from each subject, which cause time dependent correlations within subjects. Generalized Estimating Equations (GEE) can deal with correlated outcomes for longitudinal data through marginal effect. Our proposed model will be based on GEE, with a semi-parametric approach, to provide a flexible structure for regression models: coefficients for parametric covariates will be estimated and nuisance covariates will be fitted in kernel smoothers for the non-parametric part. The profile kernel estimator and the seemingly unrelated kernel estimator (SUR) will be used to obtain consistent and efficient semi-parametric estimators. We provide simulation results for estimating semi-parametric models with one or multiple non-parametric terms. Financial market data is a major component of data analysis; thus, we focus on the financial market in the application part. Credit card loan data will be used with the payment information for each customer across six months to investigate whether gender, income, age, or other factors will influence payment status significantly. Furthermore, we propose model comparisons to evaluate whether different models should be fitted for different subgroups of consumers, such as male and female.

1. Introduction

For statistical scientific studies, experiment designs depend on the different types of system under study and the different goals for research. Longitudinal studies allow for the investigation of change over different time points and the effects of different factors on the change. One distinctive feature of longitudinal studies is repeated measurements at different time points within each subject (or cluster), which considers the time series correlation. For example, the financial market plays an important role in daily life. Financial institutions, such as commercial banks, investment banks, insurance companies, and brokerages are major players trading in financial markets. Most financial data analysis involves time series because time is valuable, and we want to track the temporal tendency of subjects. Therefore, once we obtain time changing measurements for each subject, as well as covariates, we can conduct longitudinal studies for financial data analysis.

A variety of longitudinal models have been applied in financial analysis. Petersen 1 pointed out that previous research focuses mainly on three major methods: the Fama-MacBeth procedure (Fama and MacBeth) 2 estimates, dummy variables in each cluster such as the fixed effect model and adjustments within cluster correlation such as Generalized Estimating Equations (GEE). Different methods should be applied depending on different interests. For a subject specified effect, the Generalized Linear Mixed Model (GLMM) will provide a nice estimator for individual subjects. When covariates are involved in general factor or policy, GEE can be applied to investigate the relationship between the response and covariates.

In order to capture the complex relationship in longitudinal data analysis, semi-parametric and non-parametric models have been developed for financial data analysis in longitudinal studies. Sam and Jiang 3 propose a non-parametric estimator for a short rate diffusion process with yields in longitudinal structure. In this paper we will introduce a class of semi-parametric regression models with GEE, which provide a flexible structure for longitudinal data analysis. Simulation studies will be conducted to compare the performance of our proposed models with other types of models. The semi-parametric regression models will be applied to credit card loan data and models for different subgroups will be examined.

Different estimation methods have been developed for non-parametric and semi-parametric regression models when observations of the response are independent. For non-parametric regression models, kernel estimation methods based on local likelihoods and splines based on penalized likelihoods can be used; for semi-parametric regression models, partial linear models, which specify the mean of the outcome variable as a parametric function with respect to some covariates and non-parametric functions with respect to other covariates, can be used. More specifically, local polynomial kernels, smoothing splines, regression spline, and penalized splines have been introduced for non-parametric and semi-parametric regression estimation methods. Local polynomial kernels provide a different weight for neighborhood observations. Smoothing splines fit the non-parametric function through a spline function with a set of covariates. Regression splines model the non-parametric regression part with spline basis functions, with a small number of knots and penalized splines present it puts the penalty of smoothing splines on regression splines.

For longitudinal data analysis, non-parametric and semi-parametric regression should be able to deal with within-subject correlation for repeated measurements. Estimating equations based methods and likelihood based methods can be used on non-parametric regression and semi-parametric regression with kernel and spline smoothing methods. Lin and Carroll 4 proposed a kernel GEE estimator through local polynomial kernel estimating equations by the extension of a generalized linear model. Unlike the parametric GEE developed by Zeger and Liang 5, kernel GEE has limited conditions for a consistent estimator and cannot reach efficiency bound if accounting for within-subject association. Wang 6 provided the seemingly unrelated kernel (SUR) estimator which fulfills both consistency and efficiency if we consider within-subject association. For likelihood based settings, spline smoothing includes the generalized smoothing spline estimator, P-splines, and regression splines, and the smoothing spline estimator has a close relationship with linear mixed models.

Whether semi-parametric regression can be applied in marginal models and linear mixed models will depend on the goal. If we focused on semi-parametric regression in marginal models, several estimation methods have been developed to deal with the within-subject correlations. Lin and Carroll 7 developed profile-kernel estimating equations which estimate the parametric part by a profile method and the non-parametric part by the kernel GEE with local polynomial kernels, which we mentioned above. The estimator from profile-kernel methods is consistent only when ignoring within-subject correlation and it is not semi-parametric efficient even without the within-subject correlation for non-parametric part. Wang, Carroll, and Lin 8 used the SUR kernel model for the non-parametric part and remained estimating the parametric part with the profile method, providing an estimator with consistency and semi-parametric efficiency. For semi-parametric linear mixed models, we can also use the profile SUR kernel methods to fit the model and the spline method as well.

The rest of this article will be organized as follows. In Section 2, we will display mathematical details for the semi-parametric model and semi-parametric kernel estimating equations. Different estimators with different approaches will be fully developed with closed form solutions, such as kernel average estimator (Lin and Carroll) 7 and the SUR kernel estimator (Wang, Carroll, and Lin) 8. In Section 3, we will show a simulation study that follows the models in Section 2. Results with estimated coefficients and overall fitting mean square errors for parametric estimators and semi-parametric estimators will be provided, showing the difference between parametric models and semi-parametric models. For each model, we display two setups with separated training and testing datasets. Section 4 would be data application part. Data description will be provided first, showing details of predictors and responses variables in credit card loan dataset. We conduct an overall model first, and we provided results for analysis when fitting model separately based on different level of factors. Section 5 shows conclusion and discussion, we will provide a summary and we will also discuss some challenges when we conduct semi-parametric models.

2. Models

In this section we propose semi-parametric models for GEE. We will provide local polynomial kernel GEE estimator and the seemingly unrelated kernel estimator, which are two main tools for model fitting. The difference of consistency and efficiency between those estimators will be displayed when accounting for association within subjects.

Suppose that is an outcome scalar for subject at time period Given some other covariates and where is a vector; is a vector. For semi-parametric regression, our model setup will be:

where is a known monotonic link function and are kernel smooth functions. For normal distributed responses, we use identity link function and for binary response, we use a logit link function and with is the probability when is the covariate vector for the parametric part while is the covariate vector for the non-parametric part and is a coefficient vector in the parametric part. Still, we would like to provide a profile-kernel estimator and profile SUR estimator for this semi-parametric regression model with multiple kernel smoothers.

2.1. Profile-Kernel Estimating Equations with Two Kernel Smoothers

We follow Lin and Carroll's 7 method, using a back-fitting algorithm to calculate the profile-kernel estimator, which had three steps in general: for a given and other kernel smoother terms, we can estimate one of the non-parametric terms, using non-parametric estimating equations. After we estimate that non-parametric term, we can estimate the rest kernel smoother terms and after we have finished the estimator of all non-parametric terms, a traditional generalized estimating equation can be used to obtain estimator.

Suppose we have a semi-parametric model with two kernel smoother terms:

where we define is the covariate for the non-parametric part and

Step 1: Given and the estimating equation for is:

where is an matrix with the row is and are vectors: and we use the identity link function. are kernel weighs of the target value for subject. and is the first derivative of

and in which is a scale parameter and is known weight. is an invertible working correlation matrix for where we construct some structures such as AR(1) or exchangeable correlation forms.

Through the estimating equations, the local average kernel GEE estimator has a closed form solution:

Step 2: After we obtain and given , we can proceed to calculate by another estimating equation. Still, through the estimating equations, the local average kernel GEE estimator has a closed form solution.

Step 3: After estimating the non-parametric parts and we can proceed to estimate through solving the adjusted generalized estimating equations:

where

and is a working correlation matrix.

We followed Fan and Li 9, providing that the estimating equation has a closed form solution for :

where is covariates matrix and is response variable. is the coefficient for the non-parametric regression estimator is the coefficient for the non-parametric regression estimator and If we write then and is the true correlation matrix for

Once we obtain we can update and until convergence.

2.2. Profile SUR Kernel Estimator with Two Kernel Smoothers

Following Wang, Carroll, and Lin's 8 method, we propose the SUR kernel estimator for semi-parametric model with two kernel smoother terms. Still, a back-fitting three-step iteration can be used for the estimation.

Step 1: Let be the current estimator of Given and let

be the solution to the kernel equation

where the element of is

and is the first derivative of the function evaluated at

The updated estimator of is and is an matrix of zeros except the column is where is an vector of zeros except with the entry being 1 and denotes the bandwidth parameter.

A closed form solution with identity link can be obtained by:

Where

Here is an vector and is the total number of observations, is an matrix, and

Step 2: Given and we obtained from last step, the estimator of can be calculated by another kernel equation with a closed form solution.

Step 3: After we obtain the estimators for two kernel smoothers, we can calculate by solving the adjusted estimating equation. And still, we can update by:

Where is the coefficient for the non-parametric regression estimator is the coefficient for the non-parametric regression estimator

Then we can run a full iteration through those backfitting steps until convergence.

3. Simulation Results

In this section, simulations are conducted to compare different estimation methods. Bias, standard deviation, and mean square error for estimators will be used to evaluate the performance of different approaches in parametric and semi-parametric models. Different scenarios based on the local polynomial kernel GEE estimator and the SUR estimator will be used to display when and which unbiased estimator will achieve the least standard deviation under given conditions. For estimating the non-parametric part, the Gaussian density kernel will be used to construct kernel weights in the non-parametric smoother and the least square cross validation method (Silverman) will be used to select bandwidth parameter which is critical for kernel regression models. In this simulation part, we will focus on the estimation of and overall fitting of different estimators. Mean and standard deviation of the estimated will be displayed. Overall fitting performance of different approaches will be examined based on the mean square error. A training dataset and test dataset will be used to evaluate the performance of semi-parametric models and parametric models.

3.1. Semi-Parametric Model with One Kernel Smoother

Consider a model with the non-parametric part and linear part in the form:

(3.1)

where denotes the subject and denotes the time point. In the equation, is a kernel smooth function, denotes covariates in the non-parametric part, denotes covariates in the parametric part, and is the coefficient vector. In this simulation, data is generated with the following set-up:

• Each run with 100 subjects, each subject with and 4 or 10 time points and 200 replicates.

in the first setup and in the second setup.

and are both scalars and time-varying covariates with where and are independent to each other and follow uniform distribution

is a vector that follows multivariate normal distribution with mean zero and correlation coefficient matrix which is an AR(1) working correlation matrix with a lower entry = 0.3 and upper entry = 0.7, respectively.

For estimating the semi-parametric model in (3.1), a semi-WI estimator with independent working correlation matrix and semi-True estimator with true working correlation matrix will be used in different scenarios. Parametric estimating approaches, such as estimators based on the following three parametric models, will be used in this simulation:


3.1.1. Local Kernel Estimator with One Kernel Smoother

In this section, we first show the results of local polynomial kernel GEE estimator with one kernel smoother for semi-parametric regression and other estimators for parametric regression with various scenarios, as we discussed in Section 3.1.

Table 1 shows estimates in the semi-parametric model (3.1) and parametric models in (para1–para3). The results for two setups show the standard errors of the estimates based on semi-parametric estimators are at least 3 times less than the standard errors of the estimators from the three parametric models (para1–para3). Similarity, we found that the standard errors for = 0.7 are lower than those for = 0.3 in the parametric estimators and semi-parametric estimator for the first setup, while in the second setup, standard errors are higher when = 0.7. For overall fitting mean square errors in the semi-parametric model (3.1) and parametric models (para1-para3). The mean square errors in the parametric estimators are larger than the mean square errors in semi-parametric estimators for both training dataset and test dataset. The gain is larger when we applied the second setup.

Figure 1 above shows the non-parametric part fitting when by the profile kernel estimator. The black line shows the true value, the blue line shows the fitting result using the independence working correlation matrix, while the red line shows fitting result using the true working correlation matrix. The three lines almost overlapped, which indicates that the results, using different working correlation matrices, deliver similar results in the non-parametric fitting part.

Table 2 shows estimates with Gaussian density kernel and ten time periods. The results are similar to results with four time periods, however the standard deviation for semi-parametric estimators are less than the situation when only four time points are involved, which indicates that the semi-parametric estimator gains more efficiency than parametric estimators when we have longer time periods. Moreover, we found that the standard errors for = 0.7 are slightly higher than those for = 0.3 in the parametric estimators. For the overall fitting mean square errors with the Gaussian density kernel and ten time periods, the results show that the mean square train and test errors for the semi-parametric approach are much lower than parametric estimates in all cases. Among parametric cases, the polynomial model performs best, but still much worse than semi-parametric fitting. Furthermore, mean square errors for semi-parametric estimators are less than the situation when only four time points are involved, which indicates that the semi-parametric estimator gains more accuracy than parametric estimators when we have longer time periods. Finally, we found that the mean square errors in training and testing datasets for = 0.7 are slightly lower than those for = 0.3 in the semi-parametric estimators. When we extend the time periods to ten, the coefficient estimators of the second setup have less bias and standard deviation when compared to the first setup. The semi-parametric estimator still gains more when we have longer time periods.

The result in the two tables shows that semi-parametric estimators with a stronger correlation, longer time period, and a more complicated pattern in the non-parametric part will benefit more when compared to parametric estimators with the same scenarios. According to the conclusion in Lin and Carroll 7 and results from our simulation, WI estimators perform better than an estimator compiled with a true correlation relationship, which conflicts with the properties of the GEE estimator. Another approach proposed by Wang 6 will be displayed in the next part, which delivers the estimator with the highest efficiency when fitting with true within subject association.


3.1.2. The SUR Estimator with One Kernel Smoother

The SUR estimator (Wang) 6 will be displayed in this part for the semi-parametric regression, running a simulation that follows the setups in the first part. Still, different setups will be applied in the simulation results.

Table 3 shows estimates in the semi-parametric model (3.1) and parametric models in (para1–para3). The results show that the standard errors of the estimates based on semi-parametric estimators are at least three times less than the standard errors of the estimators from the three parametric models (para1–para3) and for semi-parametric estimators, semi-True has smaller standard errors than semi-WI. Similarly, we found that the standard errors for = 0.7 are not higher than those for = 0.3 in the parametric estimators and semi-parametric estimators. Among the parametric models, the polynomial model (para) has the smallest mean square error for the test dataset. Similar to the kernel estimation in the last part, in the second setup, the overall fitting accuracy in the semi-parametric model (3.1) gains more than the parametric models (para1–para3).

3.2. Semi-Parametric Model with Multiple Kernel Smoothers

Consider another model with two kernel smoothers in the non-parametric part:

(3.2)

Still, denotes the subject and denotes the time point. In the equation, and are kernel smooth functions, and denote the covariates in the non-parametric part.

In this simulation, data are generated with the following set-up:

• Each run with 100 subjects, each subject with four time points and 200 replicates.

• The first setup is in the first non-parametric term and in the second non-parametric term; the second setup is in the first non-parametric term and in the second non-parametric term.

and are all scalars and time-varying covariates with and . where and are independent to each other and follow uniform distribution

We use the same setting for the working correlation matrix and parametric models as the last part: semi-WI estimator with independent working correlation matrix and semi-True estimator with true working correlation matrix will be used in estimating non-parametric part. Estimators based on the following three parametric models and one generalized additive model will also be used in this simulation:

Similar to 3.1, in this simulation, we will focus on the estimation of and the overall fitting of different estimators. Mean and standard deviation of the estimated will be displayed. And we first show the results of the local polynomial kernel GEE estimator with two kernel smoothers for semi-parametric regression and other estimators for parametric regression with various scenarios such as different kernel densities, correlation entries, and time periods, as we discussed in the last paragraph.

Table 4 above shows estimates in the semi-parametric model (3.2) and parametric models in (para4–para6). The results in Table 4 based on the Gaussian Kernel density show that the standard errors of the estimates based on semi-parametric estimators are at least three times less than the standard errors of the estimators from the three parametric models (para4–para6). The overall fitting mean square errors in the semi-parametric model (3.2) and parametric models (para4–para6) show that for the training dataset and test dataset, the mean square errors in the parametric estimators are higher than the mean square errors in semi-parametric estimators. Still, the result in this table shows that semi-parametric estimators with a stronger correlation and more complicated pattern in non-parametric part will benefit more compared to parametric estimators with the same scenarios. When compared to models with one kernel smoothers, the MSE for parametric models (para4-para6) increased at least four times, but the MSE of profile kernel GEE model (semi-WI and Semi-True) increased two times, indicating that profile kernel GEE estimator is more robust for MSE than parametric models.

4. Application

Credit card loan data are a major type of financial data owned by banks and other financial institutions and play an important role for longitudinal data analysis as we discussed in the introduction: for each subject, which is the customer, we have records of monthly payment history for multiple time points. The semi-parametric models and the GEE method can be applied to this dataset and first we give a detailed description of a credit card loan dataset. Our main purpose for this application is to investigate which factors will influence the customer's payment status by using different approaches and to explore the difference between parametric estimators and semi-parametric estimators.

4.1. Description of the Dataset (Statistics and Data Analysis)

The dataset used in this application comes from UCI (University of California Irvine) Machine Learning Repository Website 10 with 30000 subjects and eight variables. A basic summary of statistics for those eight variables is as the follows:

1 Bill amount: Amount should be paid by each customer for current month, with minimum -339603 and maximum 1664089. A negative number shows there are credits from last month.

2 Payment amount: Amount customer paid for current month, with minimum 0 and maximum 1684259.

3 PAY: A categorical variable with values from -2 to 8$ (11 categories), denoting how many delayed periods the customer had. A negative number shows that payment is made before due day.

4 LIMIT BAL: Limit amount for each customer, with minimum 10000 and maximum 1000000.

5 SEX: With 1 denoting male and 2 denoting female.

6 EDUCATION: Education level for each customer: 1 denotes graduate school; 2 denotes university; 3 denotes high school; 4, 5, 6 denotes others.

7 MARRIAGE: Marital status: 1 denotes married; 2 denotes single; 3 denotes others.

From the eight variables, two variables can be constructed to address our main concerns. The first response variable called remaining amount, is the difference between bill amount and payment amount, showing whether the customer made full payment or not. The second response variable is the delayed pay periods denoted by PAY: PAY= 1 denoting there is a delay, no matter how long for that delay and PAY= 0 denoting no delay, which means payment was made duly or before due day. Five variables are in the list left as predictors: gender, education, marriage, age, and limit balance.

Primary parametric GEE regression will be conducted as the first step for analyzing credit card loan data. For example, after fitting a linear GEE regression with the response variable remaining amount and five predictors we discussed in the last paragraph, we get a result that four predictors are statistically significant with p-values less than 0.05, while the variable age is not statistically significant. In our semi-parametric models, the four significant predictors can be used in the parametric part, while the variable age will be treated as a non-parametric covariate. Different semi-parametric models will be estimated with different working correlation matrices, and the results from semi-parametric models will be compared with the results from parametric models.

4.2. Results and Discussion: Overall Analysis
4.2.1. Using Remaining Amount as Response Variable

In this part, the remaining amount we defined in Section 4.1 will be used as the response variable to explore the relationship between the amount of owed payments and other predictors: such as gender, education level, limit balance, marital status, and age. The following two parametric GEE models will be fitted:

and we consider a semi-parametric model with non-parametric form on the predictor age:

where is a kernel smoother.

Table 5 shows the estimation results for the first parametric model (Para1) and the second parametric model (Para2) using different working correlation matrices. Based on the signs of the estimated coefficients, we found that relative to male consumers, female consumers have less remaining amount. Relative to consumers with graduate degrees, customers with only college degrees or high school degrees have more remaining amount. Relative to married customers, customers with single marital status tend to have more remaining amount. The predictor limit balance has small coefficients, which denotes that limit balance has a positive correlation with remaining amount for both parametric model setups; age has a p-value larger than 0.05, showing that age is not statistically significant in both parametric models.

Different working correlation matrices such as independence, exchangeable, and AR1 are used in parametric models. The estimated parameters by those three working correlation matrices are quite similar across different settings in associations between time periods.

Since age is not significant with parametric patterns, such as linear and quadratic terms, we consider semi-parametric models with kernel smoother on the predictor age, investigating the changes on estimated coefficients for other predictors fitted with linear patterns and seeing if semi-parametric models are more advanced than pure parametric models.

The estimation results for the semi-parametric model with kernel smoother on the predictor age use different working correlation matrices. The result of the estimated coefficients is similar to the estimation in parametric models. Based on the signs of the estimated coefficients, we found that relative to male consumers, female consumers have less remaining amount. Relative to consumers with graduate degrees, customers with only college degrees or high school degrees have more remaining amount. Relative to married customers, the semi-parametric model shows that customers with single marital status tend to have more remaining amount, and the coefficient for predictor single (0.086) is higher than the coefficient in parametric models (0.048 in Para3). The predictor limit balance has a coefficient of 0.002, which denotes that limit balance has a positive correlation with remaining amount.


4.2.2. Using Payment Status as Response Variable

In this part, the payment status, which is whether the client has a default we defined in Section 4.1 will be used as the response variable. We would like to explore the relationship between whether the customer will default to pay the bills and other predictors, such as gender, education level, limit balance, marital status, and age. We consider the parametric GEE model with linear form as the following:

where is the probability of default.

Table 6 shows the estimation results for the parametric model (Para4) using different working correlation matrices. Based on the signs of the estimated coefficients, we found that relative to male consumers, female consumers have less probability to default. Relative to consumers with graduate degrees, customers with college degrees or high school degrees have more probability to default. Relative to married customers, customers with single marital status tend to have more probability to default. The predictor limit balance has a negative coefficient, which denotes that limit balance has a negative correlation with the probability of default, and age also has a negative correlation with the probability of default. Different working correlation matrices such as independence, exchangeable, and AR(1) matrix are used in Para1. The estimated parameters by those three working correlation matrices are quite similar.

4.3. Results and Discussion: Gender Analysis

In this section, we evaluate the difference between models for male customers and female customers. Following the overall analysis in Section 4.2, three parametric models and one semi-parametric model are fitted for analysis, and we used two different outcomes: remaining amount and payment status as the response variable. Estimated coefficients for all models are reported for the purpose of exploring the difference among the fitted models for different gender. We provide mean square error as the evaluation measurement for the comparison of parametric and semi-parametric models when using remaining amount as response variable.


4.3.1. Using Remaining Amount as Response Variable

Model Setups:

The remaining amount we defined in Section 4.1 will be used as the response variable to explore the relationship between the amount of owed payments and some predictors, such as education level, limit balance, marital status, and age for male and female customers. As an overall analysis in Section 4.2, the following three parametric GEE models will be fitted for male and female separately:

and we consider a semi-parametric model with non-parametric form on the predictor age:

where is a kernel smoother, and are indicator variables, defined as follows:

Table 7 shows the estimation results for the first parametric model (Para1) using different working correlation matrices for male and female separately. We found that unlike the overall analysis, different working correlation matrices will identify different significant variables: for male, the variables single and age are significant when using the independence working correlation matrix, while not significant when using exchangeable or AR(1) working correlation matrix; for female, age is not significant when using only the independence working correlation matrix. Relative to those with graduate degrees, male and female customers with only college degrees or high school degrees have more remaining amount. Furthermore, male customer with high school degree have slightly more remaining amount than those with college degrees under independence, exchangeable and AR(1) working correlation matrices. On the other hand, female customers with a high school degree have less remaining amount than those with a college school degree under all the three types of working correlation matrices.

Table 8 shows the estimation results for the second parametric model (Para2) using different working correlation matrices for male and female separately and a quadratic form on age. We found that unlike the overall analysis, still, different working correlation matrices will identify different significant variables: for male, the quadratic form on age is significant only when using the independence working correlation matrix; for female, the quadratic term is significant when using the exchangeable working correlation matrix or AR(1) structure. Although the quadratic form of age is significant, it has a tiny impact on the response variable remaining amount because the number of coefficients is nearly zero. For male and female customers, still, relative to those with graduate degrees, the one with only college degrees or high school degrees has more remaining amount. The marriage factor single is significant for male under the independence working correlation matrix, but it is not significant under any working correlation matrices for female customers.

The results from two parametric models show that age may be not significant using some working correlation matrices, or has a tiny effect with parametric patterns, such as quadratic terms. We consider semi-parametric models with kernel smoother on the predictor age, investigating whether semi-parametric models are more advanced for male or female than pure parametric models.

Table 9 shows the estimation results for the semi-parametric model (Semi) with kernel smoother on the predictor age using different working correlation matrices. The result of the estimated coefficients is similar to the estimation in parametric models. Based on the signs of the estimated coefficients, we found that male and female have the same direction for all predictors. Relative to consumers with graduate degrees, male and female customers with only college degrees or high school degrees have more remaining amount. Relative to married customers, the semi-parametric model shows that customers with single marriage status for either male or female tend to have more remaining amount.


4.3.2. Using Payment Status as Response Variable

In this part, the payment status we defined in Section 4.1 will be used as the response variable to evaluate the difference between male and female customers for whether the customers will default to pay the bills. Predictors such as education level, limit balance, marital status, and age will be used in our models. Especially, we would like to investigate whether male customer and female customer should be fitted with different models.

We first consider three parametric GEE model with linear form of age for male and female as the following:

where is the probability of default.

Table 10 shows the estimation results for the parametric model (Para4) using different working correlation matrices. Relative to male and female consumers with graduate degrees, customers with college degrees or high school degrees have more probability to default. For male customers, relative to married customers, customers with a single marital status tend to have more probability to default. The predictor limit balance has a negative coefficient for male and female, which denotes that limit balance has a negative correlation with the probability of default. For male, age is not significant under any types of working correlation matrices but for female, age is significant with negative coefficients with the probability of default for all four types of working correlation matrices. Different working correlation matrices such as independence, exchangeable, and AR(1) are used in Para4 for both male and female. The estimated coefficients by those three working correlation matrices are quite similar.

We did the same approach for the Education analysis, using two parametric models and one semi-parametric model to detect the impact for different education levels: customers with high school degree or university/graduate degree. When remaining balance is the response variable, the results from the parametric models show that age may be significant or has a tiny effect with parametric patterns. If we apply the semi-parametric models with kernel smoother on the predictor age, investigating whether there is a difference for customers with different education levels, the result of the estimated coefficients has some similarities to the estimation in parametric models. Based on the signs of the estimated coefficients, we found that relative to male consumers, female consumers have less remaining amount. The predictor limit balance has a positive coefficient, which denotes that limit balance has a positive correlation with remaining amount. Except for single, all other predictors are significant for all models when using any types of three working correlation matrices. When payment status is the response variable, we only used the parametric model and found that for all models, female customers tend to have less probability to default because of negative coefficients. The predictor limit balance has a negative coefficient, which denotes that limit balance has a negative correlation with the probability of default. Age is not significant for customers with high school degrees but is significant for customers with university or graduate degrees. The variable single is not significant for all models under any types of working correlation matrices.

We also did an approach for Marriage analysis, using two parametric models and one semi-parametric model to detect the impact for different marital status: single customers or married customers. When remaining balance is the response variable, the results from two parametric models show that age may be not significant using some working correlation matrices or has tiny effect with parametric patterns, such as quadratic terms. Still, if we apply the semi-parametric models with kernel smoother on the predictor age, investigating whether there is a difference for customers with different marital status, the result for the estimated coefficients has some similarities to the estimation in parametric models. Based on the signs of the estimated coefficients, we found that for Limit balance, university and graduate are significant and positive correlated with remaining amount while variable female is significant and negative correlated with remaining amount for both single and married customers. When payment status is the response variable, we still only used the parametric model and found that the result is very similar to the Education analysis.

5. Conclusion and Discussion

In summary, the simulation result shows semi-parametric estimators are more robust with less standard error comparing to parametric estimators. For overall fitting, semi-parametric models have less mean square errors. Furthermore, semi-parametric estimators with stronger correlation, longer time period, and a more complicated pattern in the non-parametric part will benefit more when comparing to parametric estimators with the same scenarios.

In application part, we run the analysis based on credit card loan data and the result display that the parametric estimators will show clear patten when we treat some features as kernel smoother. We recommend applying parametric model first, figuring out the non-significant features and setting up them with kernel smoothers. By modeling with semi-parametric structure, we find the different behavior for customers with different gender, education level and marriage status.

In semi-parametric GEE study, estimating the working correlation matrix is critical when using data from the real world. One challenge comes from the application part, which is the dataset resource. Most financial datasets used in longitudinal studies reach the individuals level, which violates the privacy policies in most institutions in the United States. Our dataset, which comes from the UCI website, is based on credit information in Taiwan. In the future, we would like to use our model on other available credit loan datasets or other types of financial datasets in the United States.

When using semi-parametric models, another challenge arises from the application part. Based on the evaluation metrics, such as Mean Square Error and predictive accuracy, we observed that the advantage of the semi-parametric model with kernel smoother is not huge. We would like to use semi-parametric models in other financial datasets with a longitudinal perspective to investigate whether our semi-parametric models with kernel smoother will be better than any other types of parametric models for other financial data.

The third challenge comes from the scheme of the semi-parametric approach. When assigning the non-parametric term in the semi-parametric approach, most applications in biological datasets use previous experience. In our approach, we used age as a non-parametric term because it is not significant under several parametric GEE approaches. We would like to try other continuous variables and create a robust approach for identifying which variable should be used as a non-parametric term.

The last challenge is a traditional issue for the GEE approach: estimating the working correlation matrix. In a semi-parametric GEE study, estimating the working correlation matrix is critical but more difficult than the parametric GEE approach. Fan, Huang, and Li 11 proposed a scheme of estimation procedure, using profile weighted least squares approach to estimate working correlation matrix. We would like to try this approach in the future to investigate whether this estimation method will provide more efficient semi-parametric estimators with fully specified working correlation matrix when we applied it in financial datasets.

Copyrights

Copyright for this article is retained by the author(s), with first publication rights granted to the journal. This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license.

References

[1]  Petersen, M.A. (2009), Estimating standard errors in finance panel data sets: Comparing approaches, The Review of Financial Studies, 22.1, 435-480.
In article      View Article
 
[2]  Fama, E. and MacBeth, J. (1973), Risk, return, and equilibrium: Empirical tests, The Journal of Political Economy, 81.3, 607636.
In article      View Article
 
[3]  Sam, A.G. and Jiang, G. (2009), Nonparametric estimation of the short rate diffusion from a panel of yields, Journal of Financial and Quantitative Analysis (JFQA), Vol. 44, No. 5.
In article      View Article
 
[4]  Lin, X., and Carroll, R. J. (2000), Nonparametric function estimation for clustered data when the predictor is measured without/with error, Journal of the American Statistical Association, 95, 520-534.
In article      View Article
 
[5]  Zeger, S.L. and Liang, K.Y. (1986), Longitudinal data analysis for discrete and continuous outcomes, Biometrika, 43, 121-130.
In article      View Article  PubMed
 
[6]  Wang, N. (2003), Marginal nonparametric kernel regression accounting for within-subject correlation, Biometrika, 90, 43-52.
In article      View Article
 
[7]  Lin, X. and Carroll, R.J. (2001), Semiparametric regression for clustered data, Biometrika, 88.4, 1179-1185.
In article      View Article
 
[8]  Wang, N., Carroll, R.J., and Lin, X. (2004), Efficient semiparametric marginal estimation for longitudinal/clustered data, Journal of the American Statistical Association, 100, 147-157.
In article      View Article
 
[9]  Fan, J. and Li, R. (2004), New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, Journal of the American Statistical Association, 99, 710-723.
In article      View Article
 
[10]  Yeh, I. C. and Lien, C. H. (2009), UCI Machine Learning Repository [https://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science.
In article      
 
[11]  Fan, J., Huang, T., and Li, R. (2007), Analysis of longitudinal data with semiparametric estimation of covariance function, Journal of the American Statistical Association, 102.478, 632-641.
In article      View Article  PubMed
 

Published with license by Science and Education Publishing, Copyright © 2021 Liu Yang and Xu-Feng Niu

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Cite this article:

Normal Style
Liu Yang, Xu-Feng Niu. Semi-Parametric Models for Longitudinal Data Analysis. Journal of Finance and Economics. Vol. 9, No. 3, 2021, pp 93-105. https://pubs.sciepub.com/jfe/9/3/1
MLA Style
Yang, Liu, and Xu-Feng Niu. "Semi-Parametric Models for Longitudinal Data Analysis." Journal of Finance and Economics 9.3 (2021): 93-105.
APA Style
Yang, L. , & Niu, X. (2021). Semi-Parametric Models for Longitudinal Data Analysis. Journal of Finance and Economics, 9(3), 93-105.
Chicago Style
Yang, Liu, and Xu-Feng Niu. "Semi-Parametric Models for Longitudinal Data Analysis." Journal of Finance and Economics 9, no. 3 (2021): 93-105.
Share
[1]  Petersen, M.A. (2009), Estimating standard errors in finance panel data sets: Comparing approaches, The Review of Financial Studies, 22.1, 435-480.
In article      View Article
 
[2]  Fama, E. and MacBeth, J. (1973), Risk, return, and equilibrium: Empirical tests, The Journal of Political Economy, 81.3, 607636.
In article      View Article
 
[3]  Sam, A.G. and Jiang, G. (2009), Nonparametric estimation of the short rate diffusion from a panel of yields, Journal of Financial and Quantitative Analysis (JFQA), Vol. 44, No. 5.
In article      View Article
 
[4]  Lin, X., and Carroll, R. J. (2000), Nonparametric function estimation for clustered data when the predictor is measured without/with error, Journal of the American Statistical Association, 95, 520-534.
In article      View Article
 
[5]  Zeger, S.L. and Liang, K.Y. (1986), Longitudinal data analysis for discrete and continuous outcomes, Biometrika, 43, 121-130.
In article      View Article  PubMed
 
[6]  Wang, N. (2003), Marginal nonparametric kernel regression accounting for within-subject correlation, Biometrika, 90, 43-52.
In article      View Article
 
[7]  Lin, X. and Carroll, R.J. (2001), Semiparametric regression for clustered data, Biometrika, 88.4, 1179-1185.
In article      View Article
 
[8]  Wang, N., Carroll, R.J., and Lin, X. (2004), Efficient semiparametric marginal estimation for longitudinal/clustered data, Journal of the American Statistical Association, 100, 147-157.
In article      View Article
 
[9]  Fan, J. and Li, R. (2004), New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, Journal of the American Statistical Association, 99, 710-723.
In article      View Article
 
[10]  Yeh, I. C. and Lien, C. H. (2009), UCI Machine Learning Repository [https://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science.
In article      
 
[11]  Fan, J., Huang, T., and Li, R. (2007), Analysis of longitudinal data with semiparametric estimation of covariance function, Journal of the American Statistical Association, 102.478, 632-641.
In article      View Article  PubMed