﻿ Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model
Publications are Open
Access in this journal
Article Versions
Export Article
Cite this article
• Normal Style
• MLA Style
• APA Style
• Chicago Style
Research Article
Open Access Peer-reviewed

### Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model

Onyegbuchulem B.O. , Nwakuya M.T, Nwabueze J.C, Otu Archibong Otu
American Journal of Applied Mathematics and Statistics. 2019, 7(3), 105-111. DOI: 10.12691/ajams-7-3-4
Received January 18, 2019; Revised March 28, 2019; Accepted May 04, 2019

### Abstract

Quantile Regression (QR) performed better than Ordinary Least Square (OLS) when the Data is skewed. Its best result can be achieved when the Data is transformed. Quantreg package of R software was used to illustrate the various power transformation fitness for quantile regression model. The analysis shows that the best result was obtained from the square root of y transformation with an average error term of 0.9539, -0.0494, 0.0238, -0.5309 and -0.7544 for 10th, 25th, 50th, 75th and 90th quantile respectively. From the results obtained, it shows that model transformation can greatly improve the result of quantile regression model.

### 1. Introduction

Conditional-median regression is a special case of quantile regression in which the conditional 50th quantile is modeled as a function of covariates. More generally, other quantiles can be used to describe non-central positions of a distribution; the quantile notion generalizes specific terms like quartile, quintile, decile, and percentile. The pth quantile denotes that value of the dependent variable below which the proportion of the population is p. Thus, quantiles can specify any position of a distribution 7.

The first – order Quantile Regression model was introduced by Koenker and Bassett 2. It has the form

 (1)

Where

is the conditional value of the dependent variable given in the trial,

is the intercept,

is a parameter,

denotes the quantile (eg., = 0.5 for the median),

is the value of the independent variable in the trial,

is the common distribution function of the error

eg is the median or 0.5 quantile.

This model conditional quantile is a function of covariates. Therefore, quantile regression model is naturally an extension of the linear-regression model. While the linear-regression model specifies the change in the conditional mean of the dependent variable associated with a change in the covariates, the quantile- regression model specifies changes in the conditional quantile. Since any quantile can be used, it is possible to model any predetermined position of the distribution. Thus, researchers can choose positions that are tailored to their specific inquiries.

However, the expected error term of quantile regression models especially the median regression model which is closely related to the linear regression in term of precision often than not don’t approximate to zero. Reference 1 showed that the expected error term of multiple quantile regression can be improved by transforming the response variable using log transformation. Reference 2 uses the relationship between variances and means over several groups to find the appropriate transformation for the study data which makes the variance independent of the mean. Reference 2 shows that procedure for determining the appropriate transformation is to determine the coefficient of regression of natural logarithm of group standard deviation on the natural logarithm of group average He explained that the most popular and common transformations are the power of transformation such as: Reference 1 empirically analysed the monthly earning distribution of Pakistan using logarithm transformation, where the log of monthly earning is taken as a response variable, while education, experience, age, sex, marital status, nature of work, region, and the provinces are used as explanatory variables. Therefore, this study will apply the five powers of transformation stated by 2 on the response variable to ascertain the appropriate power of transformation that can be used to model quantile regression in the presence of skewed distribution. This study is aimed at investigating the best power transformation of skewed distribution for quantile regression model. The study will specifically:

Assess the best transformation fit of the model based on some selected power transformations, Assess the impact of the covariate on the response variable and Conduct diagnostic tests on the suggested model.

### 2. Methodology

This paper investigates the best power transformation for quantile regression model. The data was generated using Monte Carlo Simulation technique from the data of Weilbull distributed data using the sharp and scale parameters of Annual salaries, income and wages of Health workers in Nigeria. The generated data shall be analysed using transformed quantile regression Model. The statistical software to be used in the analysis will be quantreg package of R Software. The hypothesis is therefore, stated as:

has no significant effect on the response variable

Covariate has a significant effect on the response variable

2.1. Quantile Regression Model

If we consider the i.i.d sample of the unconditional sample mean can be defined as the solution to the problem of minimizing sum of squared residual

 (2)

Hence the sample median is the minimizer of the sum of absolute error loss or deviations.

 (3)

To see why median can be define as a minimization problem, it can be written as

 (4)

Differentiating with respect to ξ and setting the partial derivative to zero will lead to the solution for the minimization problem. The partial derivative of the first term is:

And the partial derivative of the second term is:

Combining these two partial derivatives lead to:

 (5)

By setting we solve for the value of = , that is, the median, to satisfy the minimization problem.

For the general sample quantile ξ which is the analogue of may be formulated as the solution of the optimization problem

 (6)

Repeating the above argument for quantiles, the partial derivative for quantiles corresponding to (6)

 (7)

We set the partial derivative F and solve for the value of F that satisfies the minimization problem. Equation (7) is illustrated thus:

• Figure 1. Quantile Regression

Just as the unconditional sample mean in (2) minimizes the sum of square residuals (error lose), the conditional sample mean also minimizes the sum of square residual by replacing the scalar by the estimate of the conditional mean function is obtained

 (8)

This can be proceeded in the same way in quantile regression. to obtain an estimate of the conditional median function, the scalar ξ in (3) is replaced by the parametric function

 (9)

To obtain the estimates of the other conditional quantile function the conditional quantile is considered and the absolute values is replaced by (10):

 (10)

Minimizing (10) results in a quantile regression model. The resulting minimization problem of (10), when is formulated as a linear function of the parameters can be solved very efficiently by linear programming method. The progression of ideas that led to (10) motivated the original quantile regression model presented in 8

2.2. Model Specification

Following 8 and 7, our proposed model will take the form:

 (11)

Where

= Transformed response variable containing n observations simulated from the parameters of data on Health Workers Allowances,

= Intercept

= Unknown Parameters

= a classical error terms

= Specified quantiles of Simulated Data. This research examines the following quantiles: 0.1, 0.25, 0.5, 0.75, 0.9

= the covariate

was used.

2.3. Goodness of Fit of QRM

An analog of statistics can be readily developed for quantile regression models. Reference 5 stated that “Since linear regression model fits are based on the least square criterion and quantile regression models are based on minimizing a sum of weighted distance with different weights used depending on whether or The goodness of fit will be measured in a manner that is consistent with this criterion”, but 9 suggested measuring goodness of fit by comparing the sum of weighted distances for the model of interest with the sum in which only the intercept parameter appears.

Let be the sum of weighted distance for the full quantile regression model, and be the sum of weighted distance for the model that includes only a constant term. Therefore, using the one covariate model

 (12)

For the model that only includes a constant term, the fitted constant is the sample quantile for the sample the goodness of fit is then defined as

 (13)

Since and are nonnegative, R(p) is at most 1. Also, because the sum of weighted distance is minimized for the full-fitted model, is never greater than so R(p) is greater than or equal to zero. Thus, R(p) is within range of [0,1], a larger R(p) indicates a better model fit. The R(p) defined above allows for comparison of a fitted model with any number of covariates beyond the intercept term to model in which only the intercept term is present. This is the restricted form of a goodness-of-fit introduced by Koenker and Machado (1999) for nested models.

### 3. Data Simulation

For the study data Weibull distribution was found to be left skewed as shown in Figure 2 with sharp and scale parameters 1.78292 and122560 respectively. To simulate the data, quantile function of Weibull distribution function () which is simply the inverse of the CDF of Weibull distribution was derive by equating the CDF to F(y) and theoretically solve for y.

• Figure 2. Graph of Original Data Simulated with Weibull Distribution

Probability Density Function of a Weibull distribution function is given as

 (14)

Let the Cumulative Density Function (CDF) of a Weibull distribution be denoted by

 (15)

This is proceeded by deriving the probit function theoretically as:

 (16)

Where

= scale parameter

= shape parameter

Following the derivation of probity function, Monte Carlo simulation will then be applied on the derived function using both the sharp parameter (1.78292) and the scale parameter (122560) to generate sample size of 3000 for the response variable while the explanatory variable is simulated using the normal distribution with the sharp and scale parameters of 11.27641 and 322999 respectively.

3.1. Choice of Data Transformation
• Figure 3. Histograms and the Normality plots

Choice of appropriate data transformation will be based on the Histograms of the transformed data, the normality plots, the values of Skewness and Kurtosis of the transformed values, it will also be based on comparing the values of the mean and the median of the transformed data and assessment of the expected error terms of the quantile regression model for the various transformed series. Therefore, the following the transformation technique were considered and presented in equation 17 to 21 while their graphs are in Figure 3.

i. Log Transformation:

 (17)

ii. Square root of y transformation

 (18)

iii. Inverse of Square root of y transformation:

 (19)

iv. Inverse of y transformation:

 (20)

v. y-squared transformation:

 (21)
3.2. Results and Discussion

The graph of log of y transformation as well the graph of log of y estimate against the actual log of y show that the data was not transformed, the graph of log of y transformation is right skewed and the graph of log of y estimate against the actual y is partially curved at the center showing that log y estimate is not in full agreement with the actual log of y. Table 1 also shows that the values of the mean and the media are equal but the skewness is less than 0 while the Kurtosis is greater than 3 which mean that the distribution is not normally distributed. The result of the table two show that despite the fact that the 50th quantile is approximately zero but the 25th quantile is less than the 50th quantile which means that the result may be spurious and cannot be relied upon.

The graph of Inverse of square root of y transformation as well the graph of Inverse of square root of y estimate against the actual Inverse of square root of y show that the data was over transformed. The graph of Inverse of square root of y is left skewed while that of the graph of Inverse of square root of y estimate against the actual Inverse of square root of y shows a vertical straight line meaning that the distribution of Inverse of square root of y estimate is not in agreement with the actual Inverse of square root of y. the result of table 0ne shows that while the median is relatively equal to the mean, the Skewness is greater than zero while kurtosis is far greater than three. The result of Table 2 shows that while the expected error term of the 50th quantile is approximately zero that of the 10th, 25th, 75th and 90th quantiles are all less than the 50th quantile which suggests that the estimates may be spurious.

The graph of Inverse of y transformation as well the graph of Inverse of y estimate against the actual Inverse of y show that the data was not transformed. The graph of Inverse of y is left skewed with all the distributions grouped into one bar. Also, while the graph of Inverse of y estimate against the actual Inverse of y shows a vertical straight line meaning that the distribution of the inverse of y estimate is never in agreement with that of the actual inverse of y. the result of Table 1, shows that the mean is not equal to the median, the skewness is slightly greater than zero and the kurtosis is far greater than three. The result of Table 2 shows that both the 10th, 25th, 50th, 75th and 90th quantile all have their expected error term as zero which means that the estimate made with any of the quantile estimate will be the same, which makes the quantile regression model insufficient.

The graph of y-squared transformation as well the graph of y-squared estimate against the actual y-squared show that the data was not transformed. The graph of y-squared transformation is also left skewed, also while the graph of y-squared estimate against the actual y-square transformation shows a curved line meaning that the two distributions are not fully in agreement. The result of Table 1 shows that the mean is not equal to the median and the skewness and the kurtosis are far greater than the zero and 3 respectively meaning that the distribution is not normally distributed. The result of Table 2 shows that the 50th quantile as well all the other quantiles are all not equal to zero meaning that the distribution was not actually transformed.

The graph of square root of y transformation as well the graph of square root of y estimate against the actual square root of y show that the data has been transformed. The graph of square root of y transformation is symmetric as well mesokurtic. This can also be seen from the graph of square root of y estimate against the actual square root of y transformation which shows a straight-line curve meaning that the estimated data is in agreement with the actual data. From Table 1, it can be observed that the mean is equal to the median, and the skewness is approximately zero while the kurtosis is approximately three meaning that y-square root transformation is normally distributed. The result of the expected error term of the quantile regression estimate in Table 2 shows that the 50th quantile is approximately zero (-0.3788), while the 10th, 25th, 75th and 90th quantiles are distributed around the 50th quantile with 6.0081, 1.1496, -2.0249 and -4.4646 respectively meaning that the model has met the assumption that the expected error term must be zero hence can be said to be efficient.

Table 3. shows that the intercept has coefficient values of 25.5475, 25.5516, 25.5583, 25.5560, 25.5556 for the 10th, 25th, 50th, 75th, 90th quantile respectively. The explanatory variable has coefficients value of 0.00004 for all the five quantiles. The results of the p-values show that all the coefficient value have significant effect on the explanatory variable The result of the individual standard errors shows a minimal error in the model hence the model of square root transformation in quantile regression model is efficient at the 50th quantile and can be relied on. With the confirmation of the efficiency, as well the reliability of the study model, the model can therefore be used to make some conclusive remark: The graph of Figure 4 shows the existence of wide discrepancy between the upper and lower income earners in the health institutions.

• Figure 4. Graph of low and upper class

### 4. Conclusion

Having painstakingly navigated through the transformation of data using different kinds of power transformation which includes: logarithm of y transformation, transpose of y transformation, transpose of square root of y transformation, square root of y transformation, transpose of y-squared transformation and y-squared transformation, the result showed that square root of y transformation is the better transformation fit for Weibull distributed data on quantile regression model based on the plots of the histogram of the transformed data, plot of the estimated data against the actual data, expected error term and normality test using the mean, median, skewness and kurtosis.

### References

 [1] Arshad, I. A., Younas, U., Shaikh,A.W & Chandio,M.S (2016). Quantile Regression Analysis of Monthly Earnings in Pakistan; Sindh Univ. Res. Jour. (Sci. Ser.) Vol. 48 (4) 919-924 (2016). In article [2] Bartlett, M.S (1974). The use of Transformation, Biometrica 3, 39-52. In article [3] Chaudhuri, P. &Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 8, 561-576. In article [4] Frost, J (2012) How to Identify the Distribution of Your Data using Minitab, https://www.scribd.com/doc/84506538/Body-Fat-Data-for-Identifying-Distribution-in-Minitab. In article [5] [Hao L. &Naiman, D.Q., (2007). Quantile Regression; 01-Hao.qxd. 3/13/2007.3.28. In article View Article [6] Iwueze, S.I., Nwogu, E.C., Ohakwe, J. & Ajaraogu, J.C. (2011) Uses of the Buys-Ballot Table in Time Series Analysis, Applied Mathematics Journal. (2) 633-645. In article View Article [7] Koenker, R. (2005). Quantile Regression, Econometric Society Monograph Series, Cambridge University Press. (6)6. In article View Article [8] Koenker,R & Bassett, G. (1978); Regression Quantiles, Econometrica, Vol. 46, No. 1, pp. 33-50. In article View Article [9] Koenker, R. &D’Orey, V. (1987). Algorithm AS229: Computing regression quantiles, Applied Statistics, 36, 383-393. In article View Article [10] Koenker, R. & Machado J.A (1999) Goodness of fit and related inference processes for quantile regression. Journal of Econometrics, 93, 327-344 In article View Article [11] Lee, B.-J. & Lee, M. J. (2006). Quantile regression analysis of wage determinants in the Korean labor market, The Journal of the Korean Economy, 7, 1-31. In article [12] Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 12, 361-386. In article [13] McMillen, D.P. (2013). Quantile Regression for Spatial Data, Springer Briefs in Regional Science. In article View Article [14] Meinshausen, N. (2006); Quantile Regression Forests, Journal of Machine Learning Research, (7) 983-99. In article [15] Wen-ShuennDeng,Yi-Chen Lin &JinguoGong (2012) A smooth coefficient quantile regression approach to the social capital–economic growth nexus; Economic Modelling journal homepage: www.elsevier.com/locate/ecmod. In article [16] Young, T.M., Shaffer, L.B., Guess, F. M., Bensmail, H. &Leon, R.V (2008), A comparison of multiple linear regression and quantile regression for modeling the internal bond of medium density fiberboard; Forest Products Journal, 58(4). In article

Published with license by Science and Education Publishing, Copyright © 2019 Onyegbuchulem B.O., Nwakuya M.T, Nwabueze J.C and Otu Archibong Otu

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

### Cite this article:

##### Normal Style
Onyegbuchulem B.O., Nwakuya M.T, Nwabueze J.C, Otu Archibong Otu. Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model. American Journal of Applied Mathematics and Statistics. Vol. 7, No. 3, 2019, pp 105-111. https://pubs.sciepub.com/ajams/7/3/4
##### MLA Style
B.O., Onyegbuchulem, et al. "Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model." American Journal of Applied Mathematics and Statistics 7.3 (2019): 105-111.
##### APA Style
B.O., O. , M.T, N. , J.C, N. , & Otu, O. A. (2019). Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model. American Journal of Applied Mathematics and Statistics, 7(3), 105-111.
##### Chicago Style
B.O., Onyegbuchulem, Nwakuya M.T, Nwabueze J.C, and Otu Archibong Otu. "Choice of Appropriate Power Transformation of Skewed Distribution for Quantile Regression Model." American Journal of Applied Mathematics and Statistics 7, no. 3 (2019): 105-111.
Share
 [1] Arshad, I. A., Younas, U., Shaikh,A.W & Chandio,M.S (2016). Quantile Regression Analysis of Monthly Earnings in Pakistan; Sindh Univ. Res. Jour. (Sci. Ser.) Vol. 48 (4) 919-924 (2016). In article [2] Bartlett, M.S (1974). The use of Transformation, Biometrica 3, 39-52. In article [3] Chaudhuri, P. &Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 8, 561-576. In article [4] Frost, J (2012) How to Identify the Distribution of Your Data using Minitab, https://www.scribd.com/doc/84506538/Body-Fat-Data-for-Identifying-Distribution-in-Minitab. In article [5] [Hao L. &Naiman, D.Q., (2007). Quantile Regression; 01-Hao.qxd. 3/13/2007.3.28. In article View Article [6] Iwueze, S.I., Nwogu, E.C., Ohakwe, J. & Ajaraogu, J.C. (2011) Uses of the Buys-Ballot Table in Time Series Analysis, Applied Mathematics Journal. (2) 633-645. In article View Article [7] Koenker, R. (2005). Quantile Regression, Econometric Society Monograph Series, Cambridge University Press. (6)6. In article View Article [8] Koenker,R & Bassett, G. (1978); Regression Quantiles, Econometrica, Vol. 46, No. 1, pp. 33-50. In article View Article [9] Koenker, R. &D’Orey, V. (1987). Algorithm AS229: Computing regression quantiles, Applied Statistics, 36, 383-393. In article View Article [10] Koenker, R. & Machado J.A (1999) Goodness of fit and related inference processes for quantile regression. Journal of Econometrics, 93, 327-344 In article View Article [11] Lee, B.-J. & Lee, M. J. (2006). Quantile regression analysis of wage determinants in the Korean labor market, The Journal of the Korean Economy, 7, 1-31. In article [12] Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 12, 361-386. In article [13] McMillen, D.P. (2013). Quantile Regression for Spatial Data, Springer Briefs in Regional Science. In article View Article [14] Meinshausen, N. (2006); Quantile Regression Forests, Journal of Machine Learning Research, (7) 983-99. In article [15] Wen-ShuennDeng,Yi-Chen Lin &JinguoGong (2012) A smooth coefficient quantile regression approach to the social capital–economic growth nexus; Economic Modelling journal homepage: www.elsevier.com/locate/ecmod. In article [16] Young, T.M., Shaffer, L.B., Guess, F. M., Bensmail, H. &Leon, R.V (2008), A comparison of multiple linear regression and quantile regression for modeling the internal bond of medium density fiberboard; Forest Products Journal, 58(4). In article