Air quality index is a numerical measure, which is computed to determine the air quality for various geographical locations. Human activities, industry functioning and climate conditions are some of the significant factors causing variations in the air quality index. Many methods are proposed in the literature and are applied to develop models for investigating the changing behaviour of the air quality index. Among them, regression model is a statistical tool, possessing established properties, recommended frequently for static data. This paper considers estimation of a multiple linear regression model based on the information pertaining to air quality index recorded in a monitoring station located in Chennai, India. Significance of the model to the data is detailed and residual analysis is carried out for testing validity of the fitted model.
Analysis of air quality is essential for all locations in the globe, as it varies over time due to presence of chemical pollutants in the atmosphere and changing meteorological factors. National Ambient Air Quality Standards (NAAQS) have reported that, in general, twelve chemical pollutants viz., sulphur dioxide, nitrogen dioxide, particulate matter (size<2.5µm and size <10 µm), ozone, lead, carbon monoxide, ammonia, benzene, benzo, arsenic and nickel affect the air quality. Similarly, the meteorological factors like wind direction, wind speed, relative humidity and solar radiation make changes in the air quality. Humans are susceptible to many serious diseases caused by air pollution, including respiratory infections, heart disease and lung cancer. Air quality can be assessed numerically by air quality index (AQI).
The Central Pollution Control Board (CPCB) of India categorised the air quality of a geographical location, based on AQI, into six classes viz., good (0 - 50), satisfactory (51 - 100), moderate (101-200), poor (201-300), very poor (301-400) and severe (401-500). Estimation of models for determining the AQI corresponding to various levels of the chemical pollutants or for future time period is an integral part of AQI analysis.
Kumar and Goyal 1 proposed prediction of AQI for four seasons viz., summer, monsoon, post-monsoon and winter of Delhi applying principal component regression (PCR) method. They found that PCR performs better for winter compared to other seasons. Kumar and Goyal 2 introduced an approach for forecasting the air quality index for Delhi, which integrates PCR and ARIMA models. They recommended the integrated PCR and ARIMA models for forecast accuracy. Ganesh et al. 3 have conducted a case study to predict AQI for Delhi and Houston, applying support vector regression model along with training algorithms for batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. They have observed that multiple linear regression model with mini-batch gradient descent algorithm outperforms other linear regression models in predicting AQI. Recently, Abdullah et al. 4 have discussed finding stepwise multiple linear regression model to predict the PM10 concentration in the air. They have created three distinct prediction hours to formulate deterministic predictions for PM10.
Madan et al. 5 evaluated the performance of twenty different models of machine learning (ML) algorithms. They found that reinforcement and neural network models perform relatively better than the other ML algorithms. Kumar and Pande 6 have pointed out that AQI monitoring and forecasting have become essential and challenging especially for urban areas. Several ML models have been used to predict AQI.
Apart from forecasting accuracy, the fitted models are expected to possess some desirable properties. It is well known that statistical model building methods have been developed with a basis of verifiable theoretical reasoning. Appropriateness of the models can be investigated through the properties of the estimates concerned and model adequacy tests. Main objective of this paper is to construct a multiple regression model for AQI applying the statistical procedures, which is adequate to study the influence of the chemical pollutants upon the changes in AQI. In this respect, a multiple regression model is constructed for AQI based on the information collected from the CPCB monitoring station located in Velachery, Chennai.
Section 2 describes the data and assesses the presence and strength of relationship of AQI with chemical pollutants and meteorological factors. The procedure to be followed for estimation of the model is also discussed. Results pertaining to estimation of the model and testing adequacy of the model are analysed in Section 3. Findings are summarized in Section 4.
The CPCB of India has an air quality monitoring station at Velachery, which is one of the largest commercial and residential areas in southern part of the Chennai, India.
The station monitors wider area of the region including approximately 6.17 sq. kms of moderately dense forest, 1.301634 sq.kms of green cover of Indian Institute of Technology, Guindy and 55 sq. kms of marshland at Pallikaranai 7. Figure 1 displays the satellite picture of Velachery.
The CPCB monitoring station collects and maintains the daily record of nine predominant air pollutants of the region viz., Particulate Matter (PM2.5µg/m3), Nitrogen Monoxide (NO µg/m3), Nitrogen Dioxide (NO2 µg/m3), Nitrogen oxide (NOx µg/m3), Sulphur Dioxide (SO2 µg/m3), Carbon Monoxide (CO mg/m3), Ozone (O3 µg/m3), Benzene (µg/m3) and Toluene (µg/m3). In addition, meteorological readings such as wind direction (WD), wind speed (WS), relative humidity (RH) and solar ration (SR) are also measured.
Information about the above chemical pollutants and the meteorological factors for a period of 813 days from 01-01-2018 to 23-03-2020 are considered in this study. Information about some of these characteristics are found missing for 117 days. Also, information observed for 39 days are found to be extreme in magnitude. According to Fox 8, missing values and outliers in the data fail to capture the essential characteristics of the data. In environment related data, missing values can occur due to several reasons, including manual data input methods, equipment failure and inaccurate measurements. Though many algorithms have been introduced to replace missing values, none of them is proved preferable to the others 9. When missing data are not handled appropriately, they can cause bias and can lead to invalid conclusions. Here, the listwise deletion method is applied to remove the missing values 10.
Outliers in the data increase the error variance, reduce power of statistical tests and lead to biased estimates. Therefore, the outlier detection process is an essential aspect of data analysis. As pointed by Ghorbani 11, Mahalanobis distance is used as a tool to detect the outliers, and these detected outlier observations have been dropped from the data. After removal of the missing values and outliers, information about the air pollutants and metrological factors are available for 657 days.
2.2. Characteristics of the Study VariablesThe descriptive statistical measures calculated for the above-mentioned data are presented in Table 1. The AQI ranges in various levels from good (21.5) to moderate (184.8), with an average level of satisfactory (60.6). It is important to note that the air quality of Velachery has never been poor on any day. The average and minimum level concentrations of (PM2.5, NO, NO2, NOx, SO2, CO and O3) are respectively (34.1, 8.0, 14.0, 18.8, 4.8, 0.8 and 29.4) and (4.6, 0.3, 1.2, 3.4, 0.9, 0.0 and 4.3). These are smaller than the concentration levels prescribed by National Ambient Air Quality Standard (NAAQS, 2009) guideline daily thresholds 12. The maximum concentration levels of PM2.5 (123.3 µg/m3) and O3 (100.4 µg/m3) are larger than the NAAQS 2009 guideline daily thresholds of 60 µg/m3 and 100 µg/m3 respectively.
Since the AQI is determined from the records of the chemical pollutants, it is obvious that AQI depends upon the levels of chemical pollutants. Also, strength of relationship may vary. Koutsoyiannis 13 mentioned that correlation analysis determines the presence and strength of association among the study variables.
The chemical pollutants NO, NO2, NOx, Benzene, Toluene and the meteorological factors are found to have very poor, in other words no, correlation with AQI. Level of correlation between AQI with PM2.5 is high (0.85); and is moderate with CO, O3 and SO2 (0.28, 0.34, and 0.21). Figure 2 exhibits the level of correlation between each chemical pollutant with AQI. Since this study attempts to determine a regression model for AQI with appropriate regressors, information about PM2.5, CO, O3 and SO2 only are considered to fit the model.
A linear regression model can be fitted, only when the response variable has linear or approximately linear relationship with the regressors 14. Also, the response variable is required to be distributed according to a normal distribution. If required, some suitable transformation may be applied to obtain a normal distribution to the response variable. When data have a large standard deviation compared to its respective mean, a logarithmic transformation affects dampening variability, reducing asymmetry and removing heteroscedasticity 15. In this study, logarithmic transformation (log(x)) is applied to AQI and the regressors PM2.5, CO, O3 and SO2.
The diagonal cells in the pair plot of log (AQI) and the four pollutants in Figure 3, displays the histogram of the respective variable. The off-diagonal cells display the scatter diagram of each pair of study variables. It can be noted from the histogram displayed in the first diagonal cell that the distribution of log (AQI) may be approximately normal. Also, log (AQI) has approximate linear relationship with CO, PM2.5 and O3. More scatteredness of the values (log (AQI), SO2) can be observed in Figure 3. However, it is assumed that log (AQI) has approximate, may be poor, linear relationship with SO2. Also, the scatter diagrams corresponding to each pair of the four chemical pollutants indicate that the regressors may be uncorrelated.
The general form of the multiple linear regression model is given by
![]() |
where
represents the response variable,
X1, X2, …, Xk are the regressors,
are the regression co-efficients,
is the intercept term, and
denotes the error component.
Estimation of the model parameters , for given information about Y and X’s is termed as fitting or estimation of the model. When
is distributed according to a normal distribution with zero mean and constant variance, estimators of the model parameters can be obtained applying the maximum likelihood (ML) method or using the ordinary least squares (OLS) method. It is further assumed that the values of the residuals can be computed from the estimated model using
![]() |
where is the estimate of Y for the
sample.
Here, it is proposed to obtain estimators for the model parameters, based on the information on log (AQI), log (PM2.5), log (CO), log (O3) and log (SO2) for n = 657 days. Then, on successful investigation of model appropriateness, residual analysis will be performed towards analysing the validity of the model.
The OLS estimates of the model parameters are obtained using SPSS ver20. The ANOVA table displaying the results of testing the overall significance of the fitted model is exhibited in Table 2. The measures calculated for studying the overall fitness of the model are presented in Table 3. The OLS estimates of the model parameters and significance test results for each of the four chemical pollutants are displayed in Table 4.
The estimated multiple linear regression model for AQI of Velachery is
![]() |
The overall significance of this fitted model can be observed from Table 2 through the F-statistic value of 594.4, and the corresponding p-value of 0.0. These values indicate that the estimated model is significant for studying Velachery’s AQI data. Also, the Adjusted R2 value shows that 78.3 % of variation in the values of AQI are due to variations in the levels of PM2.5, CO, O3 and SO2. Standard errors of the OLS estimates of the co-efficients of log (CO), log (PM2.5), log(O3) and log (SO2) can be noted from Table 3 as 0.007, 0.001, 0.001 and 0.005 respectively. These measures point out that the corresponding OLS estimates, which can also be observed from the corresponding 95% confidence limits. The p-values calculated for testing the significance of each of these chemical pollutants are 0.00 except SO2 (0.002), which also ensures the relevance of these components in estimating log (AQI) from this fitted model. It may also be recalled that the OLS estimates of the model parameters possess the desirable properties explained in Koutsoyiannis 13.
Residuals are calculated using the estimated model for five days and are presented in Table 5.
The above computations show that the magnitudes of the residual are very small.
The descriptive statistical measures for the residuals are computed for of all the 657 days as
![]() |
and the values are presented in Table 6. The residuals vary from -0.1405 to 0.1191 with shorter range of 0.2596. The quartiles are equidistant. The mean is, approximately, zero with standard deviation of 0.0387. Differences among mean, median and mode are marginal. The co-efficients of skewness and kurtosis are respectively -0.445 and 0.809. These numerical observations lead to diagnose that the distribution of the residuals can be a normal distribution with zero mean. The diagnosis can be justified from the histogram and the P-P plot drawn for residuals and presented in Figure 4. Moreover, the Anderson-Darling test statistic value is 0.4918 with significant p-value of 0.218. This ensures that the residual values fit to a normal distribution with zero mean. Thus, the estimated model can be regarded considered satisfying the assumptions considered for model construction.
It can be noted further from the estimated model that the estimate of AQI corresponding to the absence of the four chemical pollutants can be determined as
![]() |
![]() |
Also, contribution of the chemical pollutants in determining the AQI are not equal. The role of CO is relatively more in estimating AQI. The 7.4% of every unit change in the value of log (CO), 5.8% of log (PM2.5), 1.6% of log (SO2) and 0.7% of log (O3) influence one unit change in log (AQI).
AQI is a numerical indicator of the air quality of the region concerned. In general, air quality is affected by some chemical pollutants and meteorological factors, which may vary with respect to environment and geographical location of the region. Identifying such factors can be useful for planning to initiate preventive steps in that region. Statistical model constructed for AQI can be used for studying the influence of such factors. A statistical model is constructed in this work for AQI applying the regression modelling procedure to the information obtained from CPCB monitoring station in Velachery, Chennai city. Influence of individual chemical pollutants in the atmosphere and meteorological factors, which determine Velachery's air quality, are examined. It is found from the data that meteorological factors do not have a considerable impact on air quality of Velachery. Analysis of information about the chemical pollutants showed that CO, PM2.5, O3 and SO2 have strong association with AQI of the region compared to the other five predominant chemical pollutants viz., NO, NO2, NOx, Benzene and Toluene. A multiple linear regression model is fitted to log (AQI) applying the OLS method. The statistical hypotheses tests ensure the model's fitness and significance of CO, PM2.5, O3, and SO2 in determining AQI. Residual values computed for some sets of observations made on these four chemical pollutants are very small, rather near zero. The fitted model is found adequate to analyze Velachery's AQI based on the chemical pollutants CO, PM2.5, O3, and SO2. These four pollutants are dominant for increasing tendency of AQI in Velachery.
[1] | Abdullah, S., Napi, N. N. L. M., Ahmed, A. N., Mansor, W. N. W., Mansor, A. A., Ismail, M., Ramly, Z. T. A. (2020). Development of Multiple Linear Regression for Particulate Matter (PM10) Forecasting during Episodic Transboundary Haze Event in Malaysia. Atmosphere, 11(3), 289. | ||
In article | View Article | ||
[2] | Chatterjee, S., Hadi, A. S. (2006). Regression Analysis by Example. John Wiley & Sons. | ||
In article | View Article | ||
[3] | Fox, J. (2011). Regression Diagnostics: An Introduction. Saga University Paper Series on Quantitative Applications in the Social Sciences, 07-079. Newbury Park, CA:Sage. | ||
In article | |||
[4] | Ganesh, S. S., Modali, S. H., Palreddy, S. R., Arulmozhivarman, P. (2017). Forecasting Air Quality Index using Regression Models: A case study on Delhi and Houston. International Conference on Trends in Electronics and Informatics (ICEI). IEEE. 248-254. | ||
In article | View Article | ||
[5] | Gargava, P., Shukla, V. K., Darbari, T. (2021). National Ambient Air Quality Status and Trends 2019. Central Pollution Control Board, Ministry of Environment, Forest and Climate Change, Government of India. https://cpcb. nic. in/upload/NAAQS_2019. pdf. Accessed 25th June. | ||
In article | |||
[6] | Ghorbani, H. (2019). Mahalanobis Distance and its Application for Detecting Multivariate Outliers. Facta Univ Ser Math Inform, 34(3), 583-95. | ||
In article | View Article | ||
[7] | Kang, H. (2013). The Prevention and Handling of the Missing Data. Korean journal of anesthesiology, 64(5), 402-406. | ||
In article | View Article PubMed | ||
[8] | Koutsoyiannis, A. (1977). Theory of Econometrics. 2nd edition, Palgrave MacMillan. | ||
In article | View Article | ||
[9] | Koushik, Janardhan. “From 50 Sq Km to Just Three in 30 Years: Chennai’s Pallikaranai Marsh Is Just about to Vanish.” The Indian Express, (20 Aug. 2019), indianexpress.com/article/cities/chennai/chennai-pallikaranai-marshland-report-madras-high-court-5919329. | ||
In article | |||
[10] | Kumar, A., Goyal, P. (2011). Forecasting of air quality in Delhi using principal component regression technique. Atmospheric Pollution Research, 2(4), 436-444. | ||
In article | View Article | ||
[11] | Kumar, A., Goyal, P. (2011a). Forecasting of daily air quality index in Delhi. Science of the Total Environment, 409(24), 5517-5523. | ||
In article | View Article PubMed | ||
[12] | Kumar, K., Pande, B. P. (2022). Air pollution prediction with machine learning: a case study of Indian cities. International Journal of Environmental Science and Technology, 1-16. | ||
In article | View Article | ||
[13] | Madan, T., Sagar, S., Virmani, D. (2020). Air Quality Prediction using Machine Learning Algorithms–A Review. Second International Conference on Advances in Computing, Communication Control and Networking (ICACCCN). 140-145. IEEE. | ||
In article | View Article | ||
[14] | Mohamed Noor, N., Zainudin, M. L. (2009). A Review: Missing Value in Enviromental Data Sets. Second International Conference and Workshops on Basic and Applied Sciences & Regional Annual Fundamental Science Seminar. | ||
In article | |||
[15] | Osborne, J. W., Waters, E. (2003). Four Assumptions of Multiple Regression that Researchers Should Always Test. Practical Assessment, Research and Evaluation, 8(2). 1-5. | ||
In article | |||
Published with license by Science and Education Publishing, Copyright © 2022 A. Loganathan, P. Sumithra and V. Deneshkumar
This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0/
[1] | Abdullah, S., Napi, N. N. L. M., Ahmed, A. N., Mansor, W. N. W., Mansor, A. A., Ismail, M., Ramly, Z. T. A. (2020). Development of Multiple Linear Regression for Particulate Matter (PM10) Forecasting during Episodic Transboundary Haze Event in Malaysia. Atmosphere, 11(3), 289. | ||
In article | View Article | ||
[2] | Chatterjee, S., Hadi, A. S. (2006). Regression Analysis by Example. John Wiley & Sons. | ||
In article | View Article | ||
[3] | Fox, J. (2011). Regression Diagnostics: An Introduction. Saga University Paper Series on Quantitative Applications in the Social Sciences, 07-079. Newbury Park, CA:Sage. | ||
In article | |||
[4] | Ganesh, S. S., Modali, S. H., Palreddy, S. R., Arulmozhivarman, P. (2017). Forecasting Air Quality Index using Regression Models: A case study on Delhi and Houston. International Conference on Trends in Electronics and Informatics (ICEI). IEEE. 248-254. | ||
In article | View Article | ||
[5] | Gargava, P., Shukla, V. K., Darbari, T. (2021). National Ambient Air Quality Status and Trends 2019. Central Pollution Control Board, Ministry of Environment, Forest and Climate Change, Government of India. https://cpcb. nic. in/upload/NAAQS_2019. pdf. Accessed 25th June. | ||
In article | |||
[6] | Ghorbani, H. (2019). Mahalanobis Distance and its Application for Detecting Multivariate Outliers. Facta Univ Ser Math Inform, 34(3), 583-95. | ||
In article | View Article | ||
[7] | Kang, H. (2013). The Prevention and Handling of the Missing Data. Korean journal of anesthesiology, 64(5), 402-406. | ||
In article | View Article PubMed | ||
[8] | Koutsoyiannis, A. (1977). Theory of Econometrics. 2nd edition, Palgrave MacMillan. | ||
In article | View Article | ||
[9] | Koushik, Janardhan. “From 50 Sq Km to Just Three in 30 Years: Chennai’s Pallikaranai Marsh Is Just about to Vanish.” The Indian Express, (20 Aug. 2019), indianexpress.com/article/cities/chennai/chennai-pallikaranai-marshland-report-madras-high-court-5919329. | ||
In article | |||
[10] | Kumar, A., Goyal, P. (2011). Forecasting of air quality in Delhi using principal component regression technique. Atmospheric Pollution Research, 2(4), 436-444. | ||
In article | View Article | ||
[11] | Kumar, A., Goyal, P. (2011a). Forecasting of daily air quality index in Delhi. Science of the Total Environment, 409(24), 5517-5523. | ||
In article | View Article PubMed | ||
[12] | Kumar, K., Pande, B. P. (2022). Air pollution prediction with machine learning: a case study of Indian cities. International Journal of Environmental Science and Technology, 1-16. | ||
In article | View Article | ||
[13] | Madan, T., Sagar, S., Virmani, D. (2020). Air Quality Prediction using Machine Learning Algorithms–A Review. Second International Conference on Advances in Computing, Communication Control and Networking (ICACCCN). 140-145. IEEE. | ||
In article | View Article | ||
[14] | Mohamed Noor, N., Zainudin, M. L. (2009). A Review: Missing Value in Enviromental Data Sets. Second International Conference and Workshops on Basic and Applied Sciences & Regional Annual Fundamental Science Seminar. | ||
In article | |||
[15] | Osborne, J. W., Waters, E. (2003). Four Assumptions of Multiple Regression that Researchers Should Always Test. Practical Assessment, Research and Evaluation, 8(2). 1-5. | ||
In article | |||