Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model

Onyegbuchulem B.O.; Nwakuya M.T; Nwabueze J.C; Otu Archibong Otu

doi:10.12691/ajams-7-3-4

Article Versions

Export Article

Cite this article

Normal Style
MLA Style
APA Style
Chicago Style

Research Article

Open Access Peer-reviewed

Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model

Onyegbuchulem B.O., Nwakuya M.T, Nwabueze J.C, Otu Archibong Otu

American Journal of Applied Mathematics and Statistics. 2019, 7(3), 105-111. DOI: 10.12691/ajams-7-3-4

Received January 18, 2019; Revised March 28, 2019; Accepted May 04, 2019

Abstract

Quantile Regression (QR) performed better than Ordinary Least Square (OLS) when the Data is skewed. Its best result can be achieved when the Data is transformed. Quantreg package of R software was used to illustrate the various power transformation fitness for quantile regression model. The analysis shows that the best result was obtained from the square root of y transformation with an average error term of 0.9539, -0.0494, 0.0238, -0.5309 and -0.7544 for 10th, 25th, 50th, 75th and 90^th quantile respectively. From the results obtained, it shows that model transformation can greatly improve the result of quantile regression model.

Keywords: Quantile Regression skewed distribution power transformation and model selection

1. Introduction

Conditional-median regression is a special case of quantile regression in which the conditional 50^th quantile is modeled as a function of covariates. More generally, other quantiles can be used to describe non-central positions of a distribution; the quantile notion generalizes specific terms like quartile, quintile, decile, and percentile. The pth quantile denotes that value of the dependent variable below which the proportion of the population is p. Thus, quantiles can specify any position of a distribution ⁷.

The first – order Quantile Regression model was introduced by Koenker and Bassett ². It has the form

(1)

Where

is the conditional value of the dependent variable given in the trial,

is the intercept,

is a parameter,

denotes the quantile (eg., = 0.5 for the median),

is the value of the independent variable in the trial,

is the common distribution function of the error

eg is the median or 0.5 quantile.

This model conditional quantile is a function of covariates. Therefore, quantile regression model is naturally an extension of the linear-regression model. While the linear-regression model specifies the change in the conditional mean of the dependent variable associated with a change in the covariates, the quantile- regression model specifies changes in the conditional quantile. Since any quantile can be used, it is possible to model any predetermined position of the distribution. Thus, researchers can choose positions that are tailored to their specific inquiries.

However, the expected error term of quantile regression models especially the median regression model which is closely related to the linear regression in term of precision often than not don’t approximate to zero. Reference ¹ showed that the expected error term of multiple quantile regression can be improved by transforming the response variable using log transformation. Reference ² uses the relationship between variances and means over several groups to find the appropriate transformation for the study data which makes the variance independent of the mean. Reference ² shows that procedure for determining the appropriate transformation is to determine the coefficient of regression of natural logarithm of group standard deviation on the natural logarithm of group average He explained that the most popular and common transformations are the power of transformation such as: Reference ¹ empirically analysed the monthly earning distribution of Pakistan using logarithm transformation, where the log of monthly earning is taken as a response variable, while education, experience, age, sex, marital status, nature of work, region, and the provinces are used as explanatory variables. Therefore, this study will apply the five powers of transformation stated by ² on the response variable to ascertain the appropriate power of transformation that can be used to model quantile regression in the presence of skewed distribution. This study is aimed at investigating the best power transformation of skewed distribution for quantile regression model. The study will specifically:

Assess the best transformation fit of the model based on some selected power transformations, Assess the impact of the covariate on the response variable and Conduct diagnostic tests on the suggested model.

2. Methodology

This paper investigates the best power transformation for quantile regression model. The data was generated using Monte Carlo Simulation technique from the data of Weilbull distributed data using the sharp and scale parameters of Annual salaries, income and wages of Health workers in Nigeria. The generated data shall be analysed using transformed quantile regression Model. The statistical software to be used in the analysis will be quantreg package of R Software. The hypothesis is therefore, stated as:

has no significant effect on the response variable

Covariate has a significant effect on the response variable

2.1. Quantile Regression Model

If we consider the i.i.d sample of the unconditional sample mean can be defined as the solution to the problem of minimizing sum of squared residual

(2)

Hence the sample median is the minimizer of the sum of absolute error loss or deviations.

(3)

To see why median can be define as a minimization problem, it can be written as

(4)

Differentiating with respect to ξ and setting the partial derivative to zero will lead to the solution for the minimization problem. The partial derivative of the first term is:

And the partial derivative of the second term is:

Combining these two partial derivatives lead to:

(5)

By setting we solve for the value of = , that is, the median, to satisfy the minimization problem.

For the general sample quantile ξ which is the analogue of may be formulated as the solution of the optimization problem

(6)

Repeating the above argument for quantiles, the partial derivative for quantiles corresponding to (6)

(7)

We set the partial derivative F and solve for the value of F that satisfies the minimization problem. Equation (7) is illustrated thus:

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
NEXT
View next figure
Figure 1. Quantile Regression

Just as the unconditional sample mean in (2) minimizes the sum of square residuals (error lose), the conditional sample mean also minimizes the sum of square residual by replacing the scalar by the estimate of the conditional mean function is obtained

(8)

This can be proceeded in the same way in quantile regression. to obtain an estimate of the conditional median function, the scalar ξ in (3) is replaced by the parametric function

(9)

To obtain the estimates of the other conditional quantile function the conditional quantile is considered and the absolute values is replaced by (10):

(10)

Minimizing (10) results in a quantile regression model. The resulting minimization problem of (10), when is formulated as a linear function of the parameters can be solved very efficiently by linear programming method. The progression of ideas that led to (10) motivated the original quantile regression model presented in ⁸

2.2. Model Specification

Following ⁸ and ⁷, our proposed model will take the form:

(11)

Where

= Transformed response variable containing n observations simulated from the parameters of data on Health Workers Allowances,

= Intercept

= Unknown Parameters

= a classical error terms

= Specified quantiles of Simulated Data. This research examines the following quantiles: 0.1, 0.25, 0.5, 0.75, 0.9

= the covariate

was used.

2.3. Goodness of Fit of QRM

An analog of statistics can be readily developed for quantile regression models. Reference ⁵ stated that “Since linear regression model fits are based on the least square criterion and quantile regression models are based on minimizing a sum of weighted distance with different weights used depending on whether or The goodness of fit will be measured in a manner that is consistent with this criterion”, but ⁹ suggested measuring goodness of fit by comparing the sum of weighted distances for the model of interest with the sum in which only the intercept parameter appears.

Let be the sum of weighted distance for the full quantile regression model, and be the sum of weighted distance for the model that includes only a constant term. Therefore, using the one covariate model

(12)

For the model that only includes a constant term, the fitted constant is the sample quantile for the sample the goodness of fit is then defined as

(13)

Since and are nonnegative, R(p) is at most 1. Also, because the sum of weighted distance is minimized for the full-fitted model, is never greater than so R(p) is greater than or equal to zero. Thus, R(p) is within range of [0,1], a larger R(p) indicates a better model fit. The R(p) defined above allows for comparison of a fitted model with any number of covariates beyond the intercept term to model in which only the intercept term is present. This is the restricted form of a goodness-of-fit introduced by Koenker and Machado (1999) for nested models.

3. Data Simulation

For the study data Weibull distribution was found to be left skewed as shown in Figure 2 with sharp and scale parameters 1.78292 and122560 respectively. To simulate the data, quantile function of Weibull distribution function () which is simply the inverse of the CDF of Weibull distribution was derive by equating the CDF to F(y) and theoretically solve for y.

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
PREV
View previous figure
NEXT
View next figure
Figure 2. Graph of Original Data Simulated with Weibull Distribution

Probability Density Function of a Weibull distribution function is given as

(14)

Let the Cumulative Density Function (CDF) of a Weibull distribution be denoted by

(15)

This is proceeded by deriving the probit function theoretically as:

(16)

Where

= scale parameter

= shape parameter

Following the derivation of probity function, Monte Carlo simulation will then be applied on the derived function using both the sharp parameter (1.78292) and the scale parameter (122560) to generate sample size of 3000 for the response variable while the explanatory variable is simulated using the normal distribution with the sharp and scale parameters of 11.27641 and 322999 respectively.

3.1. Choice of Data Transformation

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
PREV
View previous figure
NEXT
View next figure
Figure 3. Histograms and the Normality plots

Choice of appropriate data transformation will be based on the Histograms of the transformed data, the normality plots, the values of Skewness and Kurtosis of the transformed values, it will also be based on comparing the values of the mean and the median of the transformed data and assessment of the expected error terms of the quantile regression model for the various transformed series. Therefore, the following the transformation technique were considered and presented in equation 17 to 21 while their graphs are in Figure 3.

i. Log Transformation:

(17)

ii. Square root of y transformation

(18)

iii. Inverse of Square root of y transformation:

(19)

iv. Inverse of y transformation:

(20)

v. y-squared transformation:

(21)

3.2. Results and Discussion

The graph of log of y transformation as well the graph of log of y estimate against the actual log of y show that the data was not transformed, the graph of log of y transformation is right skewed and the graph of log of y estimate against the actual y is partially curved at the center showing that log y estimate is not in full agreement with the actual log of y. Table 1 also shows that the values of the mean and the media are equal but the skewness is less than 0 while the Kurtosis is greater than 3 which mean that the distribution is not normally distributed. The result of the table two show that despite the fact that the 50^th quantile is approximately zero but the 25^th quantile is less than the 50^th quantile which means that the result may be spurious and cannot be relied upon.

The graph of Inverse of square root of y transformation as well the graph of Inverse of square root of y estimate against the actual Inverse of square root of y show that the data was over transformed. The graph of Inverse of square root of y is left skewed while that of the graph of Inverse of square root of y estimate against the actual Inverse of square root of y shows a vertical straight line meaning that the distribution of Inverse of square root of y estimate is not in agreement with the actual Inverse of square root of y. the result of table 0ne shows that while the median is relatively equal to the mean, the Skewness is greater than zero while kurtosis is far greater than three. The result of Table 2 shows that while the expected error term of the 50^th quantile is approximately zero that of the 10^th, 25^th, 75^th and 90^th quantiles are all less than the 50^th quantile which suggests that the estimates may be spurious.

The graph of Inverse of y transformation as well the graph of Inverse of y estimate against the actual Inverse of y show that the data was not transformed. The graph of Inverse of y is left skewed with all the distributions grouped into one bar. Also, while the graph of Inverse of y estimate against the actual Inverse of y shows a vertical straight line meaning that the distribution of the inverse of y estimate is never in agreement with that of the actual inverse of y. the result of Table 1, shows that the mean is not equal to the median, the skewness is slightly greater than zero and the kurtosis is far greater than three. The result of Table 2 shows that both the 10^th, 25^th, 50^th, 75^th and 90^th quantile all have their expected error term as zero which means that the estimate made with any of the quantile estimate will be the same, which makes the quantile regression model insufficient.

Table 1. Test of Normality of the Transformed Data
Download as
PowerPoint Slide
Tables index
View option
Full Size Next Table

Table 2. Expected Error Term of the Model
Download as
PowerPoint Slide
Tables index
View option
Full Size Previous Table Next Table

The graph of y-squared transformation as well the graph of y-squared estimate against the actual y-squared show that the data was not transformed. The graph of y-squared transformation is also left skewed, also while the graph of y-squared estimate against the actual y-square transformation shows a curved line meaning that the two distributions are not fully in agreement. The result of Table 1 shows that the mean is not equal to the median and the skewness and the kurtosis are far greater than the zero and 3 respectively meaning that the distribution is not normally distributed. The result of Table 2 shows that the 50^th quantile as well all the other quantiles are all not equal to zero meaning that the distribution was not actually transformed.

The graph of square root of y transformation as well the graph of square root of y estimate against the actual square root of y show that the data has been transformed. The graph of square root of y transformation is symmetric as well mesokurtic. This can also be seen from the graph of square root of y estimate against the actual square root of y transformation which shows a straight-line curve meaning that the estimated data is in agreement with the actual data. From Table 1, it can be observed that the mean is equal to the median, and the skewness is approximately zero while the kurtosis is approximately three meaning that y-square root transformation is normally distributed. The result of the expected error term of the quantile regression estimate in Table 2 shows that the 50^th quantile is approximately zero (-0.3788), while the 10^th, 25^th, 75^th and 90^th quantiles are distributed around the 50^th quantile with 6.0081, 1.1496, -2.0249 and -4.4646 respectively meaning that the model has met the assumption that the expected error term must be zero hence can be said to be efficient.

Table 3. shows that the intercept has coefficient values of 25.5475, 25.5516, 25.5583, 25.5560, 25.5556 for the 10^th, 25^th, 50^th, 75^th, 90^th quantile respectively. The explanatory variable has coefficients value of 0.00004 for all the five quantiles. The results of the p-values show that all the coefficient value have significant effect on the explanatory variable The result of the individual standard errors shows a minimal error in the model hence the model of square root transformation in quantile regression model is efficient at the 50^th quantile and can be relied on. With the confirmation of the efficiency, as well the reliability of the study model, the model can therefore be used to make some conclusive remark: The graph of Figure 4 shows the existence of wide discrepancy between the upper and lower income earners in the health institutions.

Table 3. Coefficient and p-value of Square Root Transformed Model
Download as
PowerPoint Slide
Tables index
View option
Full Size Previous Table

Download as
PPT
PowerPoint Slide
PNG
Larger image(png format)
View option
Figures index
NEW
Larger figure in new window
PREV
View previous figure
Figure 4. Graph of low and upper class

4. Conclusion

Having painstakingly navigated through the transformation of data using different kinds of power transformation which includes: logarithm of y transformation, transpose of y transformation, transpose of square root of y transformation, square root of y transformation, transpose of y-squared transformation and y-squared transformation, the result showed that square root of y transformation is the better transformation fit for Weibull distributed data on quantile regression model based on the plots of the histogram of the transformed data, plot of the estimated data against the actual data, expected error term and normality test using the mean, median, skewness and kurtosis.

References

[1]	Arshad, I. A., Younas, U., Shaikh,A.W & Chandio,M.S (2016). Quantile Regression Analysis of Monthly Earnings in Pakistan; Sindh Univ. Res. Jour. (Sci. Ser.) Vol. 48 (4) 919-924 (2016).
	In article

[2]	Bartlett, M.S (1974). The use of Transformation, Biometrica 3, 39-52.
	In article

[3]	Chaudhuri, P. &Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 8, 561-576.
	In article

[4]	Frost, J (2012) How to Identify the Distribution of Your Data using Minitab, http://www.scribd.com/doc/84506538/Body-Fat-Data-for-Identifying-Distribution-in-Minitab.
	In article

[5]	[Hao L. &Naiman, D.Q., (2007). Quantile Regression; 01-Hao.qxd. 3/13/2007.3.28.
	In article	View Article

[6]	Iwueze, S.I., Nwogu, E.C., Ohakwe, J. & Ajaraogu, J.C. (2011) Uses of the Buys-Ballot Table in Time Series Analysis, Applied Mathematics Journal. (2) 633-645.
	In article	View Article

[7]	Koenker, R. (2005). Quantile Regression, Econometric Society Monograph Series, Cambridge University Press. (6)6.
	In article	View Article

[8]	Koenker,R & Bassett, G. (1978); Regression Quantiles, Econometrica, Vol. 46, No. 1, pp. 33-50.
	In article	View Article

[9]	Koenker, R. &D’Orey, V. (1987). Algorithm AS229: Computing regression quantiles, Applied Statistics, 36, 383-393.
	In article	View Article

[10]	Koenker, R. & Machado J.A (1999) Goodness of fit and related inference processes for quantile regression. Journal of Econometrics, 93, 327-344
	In article	View Article

[11]	Lee, B.-J. & Lee, M. J. (2006). Quantile regression analysis of wage determinants in the Korean labor market, The Journal of the Korean Economy, 7, 1-31.
	In article

[12]	Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 12, 361-386.
	In article

[13]	McMillen, D.P. (2013). Quantile Regression for Spatial Data, Springer Briefs in Regional Science.
	In article	View Article

[14]	Meinshausen, N. (2006); Quantile Regression Forests, Journal of Machine Learning Research, (7) 983-99.
	In article

[15]	Wen-ShuennDeng,Yi-Chen Lin &JinguoGong (2012) A smooth coefficient quantile regression approach to the social capital–economic growth nexus; Economic Modelling journal homepage: www.elsevier.com/locate/ecmod.
	In article

[16]	Young, T.M., Shaffer, L.B., Guess, F. M., Bensmail, H. &Leon, R.V (2008), A comparison of multiple linear regression and quantile regression for modeling the internal bond of medium density fiberboard; Forest Products Journal, 58(4).
	In article

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Cite this article:

Normal Style

Onyegbuchulem B.O., Nwakuya M.T, Nwabueze J.C, Otu Archibong Otu. Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model. American Journal of Applied Mathematics and Statistics. Vol. 7, No. 3, 2019, pp 105-111. http://pubs.sciepub.com/ajams/7/3/4

MLA Style

B.O., Onyegbuchulem, et al. "Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model." American Journal of Applied Mathematics and Statistics 7.3 (2019): 105-111.

APA Style

B.O., O. , M.T, N. , J.C, N. , & Otu, O. A. (2019). Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model. American Journal of Applied Mathematics and Statistics, 7(3), 105-111.

Chicago Style

B.O., Onyegbuchulem, Nwakuya M.T, Nwabueze J.C, and Otu Archibong Otu. "Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model." American Journal of Applied Mathematics and Statistics 7, no. 3 (2019): 105-111.

Like this article()

Figure 1. Quantile Regression
View in article
Full Size Figure

Figure 2. Graph of Original Data Simulated with Weibull Distribution
View in article
Full Size Figure

Figure 3. Histograms and the Normality plots
View in article
Full Size Figure

Figure 4. Graph of low and upper class
View in article
Full Size Figure

Table 1. Test of Normality of the Transformed Data
View in article
Full Size

Table 2. Expected Error Term of the Model
View in article
Full Size

Table 3. Coefficient and p-value of Square Root Transformed Model
View in article
Full Size

[1]	Arshad, I. A., Younas, U., Shaikh,A.W & Chandio,M.S (2016). Quantile Regression Analysis of Monthly Earnings in Pakistan; Sindh Univ. Res. Jour. (Sci. Ser.) Vol. 48 (4) 919-924 (2016).
	In article

[2]	Bartlett, M.S (1974). The use of Transformation, Biometrica 3, 39-52.
	In article

[3]	Chaudhuri, P. &Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 8, 561-576.
	In article

[4]	Frost, J (2012) How to Identify the Distribution of Your Data using Minitab, http://www.scribd.com/doc/84506538/Body-Fat-Data-for-Identifying-Distribution-in-Minitab.
	In article

[5]	[Hao L. &Naiman, D.Q., (2007). Quantile Regression; 01-Hao.qxd. 3/13/2007.3.28.
	In article	View Article

[6]	Iwueze, S.I., Nwogu, E.C., Ohakwe, J. & Ajaraogu, J.C. (2011) Uses of the Buys-Ballot Table in Time Series Analysis, Applied Mathematics Journal. (2) 633-645.
	In article	View Article

[7]	Koenker, R. (2005). Quantile Regression, Econometric Society Monograph Series, Cambridge University Press. (6)6.
	In article	View Article

[8]	Koenker,R & Bassett, G. (1978); Regression Quantiles, Econometrica, Vol. 46, No. 1, pp. 33-50.
	In article	View Article

[9]	Koenker, R. &D’Orey, V. (1987). Algorithm AS229: Computing regression quantiles, Applied Statistics, 36, 383-393.
	In article	View Article

[10]	Koenker, R. & Machado J.A (1999) Goodness of fit and related inference processes for quantile regression. Journal of Econometrics, 93, 327-344
	In article	View Article

[11]	Lee, B.-J. & Lee, M. J. (2006). Quantile regression analysis of wage determinants in the Korean labor market, The Journal of the Korean Economy, 7, 1-31.
	In article

[12]	Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 12, 361-386.
	In article

[13]	McMillen, D.P. (2013). Quantile Regression for Spatial Data, Springer Briefs in Regional Science.
	In article	View Article

[14]	Meinshausen, N. (2006); Quantile Regression Forests, Journal of Machine Learning Research, (7) 983-99.
	In article

[15]	Wen-ShuennDeng,Yi-Chen Lin &JinguoGong (2012) A smooth coefficient quantile regression approach to the social capital–economic growth nexus; Economic Modelling journal homepage: www.elsevier.com/locate/ecmod.
	In article

[16]	Young, T.M., Shaffer, L.B., Guess, F. M., Bensmail, H. &Leon, R.V (2008), A comparison of multiple linear regression and quantile regression for modeling the internal bond of medium density fiberboard; Forest Products Journal, 58(4).
	In article

Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model

Abstract

1. Introduction

2. Methodology

3. Data Simulation

Table 1. Test of Normality of the Transformed Data

Table 2. Expected Error Term of the Model

Table 3. Coefficient and p-value of Square Root Transformed Model

4. Conclusion

References

Cite this article:

Normal Style

MLA Style

APA Style

Chicago Style