Application of Spatio-Temporal Clustering in Forecasting Optimization of Geo-Referenced Time Series

Veiw figure View Figure

Step 1: The differencing order d is chosen based on the successive KPSS unit-root tests ^[12] for the stationarity of the original data or seasonally differencing data for each time in the cluster. The null hypothesis is stationary time series around a deterministic trend, while the alternate hypothesis considers time series with difference values. The selection of the best parameters of ARIMA model for a single time series is described in ^[9]. By following the interpretation reported in ^[9], if a stable seasonal pattern is selected (i.e., the null hypothesis is not rejected) then d can be selected on the original data, otherwise d is determined on the seasonally differenced data. As a difference of ^[9], selection of parameter d is determined by using repeatedly KPSS tests for each time in the cluster. If only one of them is non-stationary d will be is increase for 1, and so one, until each time series in the cluster becomes stationary.

Step 2: The stepwise algorithm to traverse the model θ=(p, q) is described in ^[9]. It is three-stepped procedures. Similarly to ^[9], the algorithm uses a stepwise search to traverse the model space combining values of p and q,. In STClu-Arima this process will refer to all time series in the cluster. Step 21) The best initial model is selected by searching for smaller p and/or q parameters trying all possible combination of 0, 1 and 2 for each time series in the cluster. Similarly to ^[9], the best initial model for each time series can be selected via the AICc information criterion (the lower AICc, the best model). The AICc is defined as follows:

where k is the number of parameters in θ; n is the length of the time series in the cluster; L*(·) is the maximum likelihood estimate of θ on the initial states for each time series in the cluster; and l is the number of time series in the cluster "C". In STClu-Arima will be selected as the best model that provides minimal average AICc for each time series in the cluster. Step 22) Variations on the current model are considered by changing current parameters of model p and/or q by ±1. The new current model in STClu-Arima becomes new best model if has lower average AIC. Step 23) Repeat step 22) until cannot be found model with lower average AICc.

The model with the best estimated parameters p, d, q is fitted to all time series in the cluster by least square regression. The model coefficients φ and σ are output and permit to produce point forecasts for testing time series as many steps ahead as required.

4. Case Studies and Applications

The goal of our experimentation is to analyze accuracy and efficiency of proposed algorithm in case of massive data, and to find a real interest of application. The second is to test the hypotheses, which motivate this work:

• using a measure of spatio-temporal distance measure, we applied a traditional k-means clustering algorithm to find spatio-temporal correlation of clustered data;

• for all time series in the cluster, improve the predictive accuracy of regression models by forecasting the future data of a stations decreasing that the run time computation costs.

In order to collect experimental evidence to test defined hypothesis, we designed an experiment consists in two phases. The first step of our algorithm calculates spatio-temporal distance measure. On a base of data representing spatio-temporal distance matrix we then applied the well-known k-means clustering grouping that way the geo-referenced time series that are correlated in time and space.

As a second step of our algorithm, we then applied our STClu-Arima regression techniques for predicting future values of time series. Function STClu-Arima is implemented in software R and operates with vectors representing values of each time series in the cluster, as explained in section 3.2.

For the evaluation of second step of our algorithm, we compare the results (rmse and computation run time) of proposed model of predictions time-series without taking into account the spatio-temporal correlation (spatio-temporal distance measure), with the model that we created, that takes into account the spatio-temporal dependence.

Proposed model described application in modeling spatio-temporal distributions in several scientific disciplines for better forecasting in environmental sciences, climate prediction and meteorology. This paper shows a new approach, which uses experimental Earth science data set to reduce run time computation costs and decrease the accuracy of proposed prediction algorithm.

This paper also explored and analyzed geophysical distributed time-series data collected by sensors widespread on the Earth’s surface to find adequate and more accurate model of forecasting time series. It took into consideration not only the dimension (space or time), but both, spatial and temporal information with the aim of improving the ability of prediction with respect to existing model avoiding such information.

4.1. Data Description

Figure 2. 2(a) Spatial position of Eco-Texas sensor stations and respective time series representing: 2(b) Temperature, 2(c) Wind-Speed and 2(d) Ozone

Download as

Veiw figure View Figure

For experimentation and evaluation of our algorithm, we considered five groups of data collected via three sensor networks: Eco-Texas, Eastern-Wind and SAC (South-American-Temperature).

The first Eco-Texas data set refers to measurement of Temperature, Wind-Speed and Ozone acquired from 26 the sensor stations installed in Texas, collected by the Texas Commission on Environment Quality (TCEQ) in the period May 5-19, 2009. This data set is collection of hourly measures (http://www.tceq. state.tx.us/) related to temperature (Temperature range [0,89], wind speed (Wind Speed range [0.3,29.5]) and ozone (Ozone range [48,105]). As training set, for this data set, we took period May 5-18, 2009, and 19 as testing data set.

The fourth data set consist of values of the wind speed (Wind Speed range [0.12,30.4]) from 1326 stations installed in Eastern-Wind measuring series at 80 meters above sea level in the eastern region of the United States. The values acquired every 30 minutes starting from January 1, 2004, 0:00 (www.ropbox.com). As training set, for this data set, we used measurements of wind speed for 144 periods (1-4 January), and last 48 (5 January) intervals as testing set. Figure 3 shows the spatial position of sensors and respective time series.

The fifth data set of our experimentation refers to 6477 sensors installed in South America, called South America Air Climate (SAC) collecting monthly-measures (144 snapshots - 12 years) - of air temperature. We applied our algorithm STClu-Arima for one part of these sensors (900 sensors). As training set, for this data set, we used measurements of temperatures from January 1999 - December 2010, and last 12 month (January - December 2011) for testing set.

4.2. Experimental Methodology

For determination of the optimal number of the clusters, we used average silhouette width. Average silhouette width refers to a method of interpretation and validation of clustered data. The technique provides a concise graphical representation of how well each object lies within its cluster.

The average silhouette width measured how tightly are grouped all the data in the cluster are how appropriately has been clustered the data. If there are too many or too few clusters, as may occur when a poor choice of k is used in the k-means algorithm, some of the clusters will typically display much narrower silhouettes than the rest. Thus, silhouette plots and averages may be used to determine the natural number of clusters within a data set.

For Eco-Texas data set, on a bases of calculated spatio-temporal distance measure we applied k-means clustering method for all 26 sensors and tried all possible solutions begging with l=2, l=3, until l=25 clusters. Then, for all clusters, we calculated average silhouette width.

For Eastern-Wind and SAC data sets, selection of the optimal number of clusters was much more difficult because this two data sets have 1326 and 900 sensor stations, and we tried to choose the optimal number of cluster beginning with k=10 until k=100 with step 10.

We selected as the optimal number of clusters where average silhouette width obtained the local maximal values.

Figure 3. Spatial position of Eastern-Wind dataset sensors and respective time series

Download as

Veiw figure View Figure

Figure 4. Spatial position of SAC sensors and respective time serie

Download as

Veiw figure View Figure

Second step of our algorithm, for all time series in the cluster applied STClu-Arima algorithm and calculating rmse. Then we compare rmse for STClu-Arima vs auto.ARIMA.

To estimate the accuracy of STClu-Arima vs auto.ARIMA (prediction and efficiency of proposed learning model), we applied Wilcoxon signed rank test for selected number of clusters for all five data set. We compared STClu-Arima models with ”auto.ARIMA”, that learns separate ARIMA models for each station choosing the best parameters according to minimal average AICc. For all data set, STClu-Arima used spatial positions of the transmitting sensor stations (the latitude and longitude), while auto.ARIMA do not.

4.3. Discussions of the Results

In the Table 1 we reported the result of our experiments. The first column shows the title of data set, while second column shows average silhouette width for selected number of clusters. The column 3-7 reports the results of comparative analysis STClu-Arima vs auto.ARIMA, for the testing data set. These columns show the results of analysis of statistical significance tests (pair wise Wilcoxon signed rank test) comparing squared residuals of the paired test time-series. The columns 3-7 reports the number of stations where STClu-Arima performs (statistically) better (columns 3-4), worse (columns 5-6), equal (column-7) than auto.ARIMA. (+) means how many times STClu-Arima performs better than auto.ARIMA (i.e. WT + $>$ WT-), (-) means that auto.ARIMA performs better than STClu-Arima (i.e. WT+ $<$ WT-), (=) means that both algorithms perform equally good (i.e. WT+ $=$ WT- ). (++) and (--) indicate results in case $H_0$ (hypothesis of equal performance) is rejected at the 0.05 significance level ARIMA (auto.ARIMA) against STClu-Arima (RMSE).

Figure 6(a) shows the results of rmse STClu-Arima versus auto.ARIMA for Eco-Texas(Wind), while the Figure 5(a) shows spatially location of clustered sensor station measuring similar spatio-temporal values.

The figures 6(b) and 6(c) reports the comparing results of rmse for Eco-Texas(Temperature) and Eco-Texas(Ozone) data set.

The fourth rows of the Table 1 indicates the results obtained for Eastern-Wind dataset for 30 clusters, while the Figure 7(a) shows the results of rmse STClu-Arima vs auto.ARIMA for the same data set. The Figure 5(b) shows spatially clustered Eastern-Wind sensor stations.

The fifth row of the Table 1, indicates the results of the pair wise Wilcoxon signed rank test, based on rmse for SAC dataset for SAC data set for 90 clusters. The Figure 7(b) shows the results of rmse STClu-Arima vs auto.ARIMA for the same data set, while the Figure 5(c) represents spatially clustered station.

This experimentation computed run time (in seconds) for both steps of our algorithm and for all five data set.

Table 1. auto.ARIMA against STClu-Arima

Download as

View current table in a new window

Tables index

Veiw figure View Table

View next table

Table 2. Computation run time

Download as

View current table in a new window

Tables index

Veiw figure View Table

View previous table

Table 2 reports run time for all performed operations:

run time to perform auto.ARIMA (column 2),

• the run time for computing spatio-temporal distance matrix (column 3);

• the run time for appling statio-temporal clustering (column 4);

• the run time for computation of STClu-Arima algorithm (column 5).

• We compare computation run time obtained from auto.ARIMA for the training time series that calculate the best forecasting parameters independently from spatio-temporal correlation of time series.

• The run time spent in learning the prediction model of the function auto.ARIMA and STClu-Arima are reported (for all dataset) in Table 2 - Eco-Texas (Wind) - row 2, Eco-Texas (Temperature) - row 3, Eco-Texas (Ozone) - row 4, Eastern-Wind - for 30 (rows 5), and SAC (row 6).

• The first column (AA-Tra) shows the computation run time for training set applying auto.ARIMA; second column column (STDist) shows the computation run time for calculating spatio-temporal matrix; the thirt column (STClu) shows the computation run time for applying k-means clustering, while last fourth column (STClu-Arima) account computation run time for applying STClu-Arima algorithm.

• Presented procedure shows that taking into account spatio-temporal distance measure by clustering stations that measured similar data in space and time; we obtained better results optimizing the choice of the forecasting parameters for the computation of the model ARIMA.

• Presented procedure shows that taking into account spatio-temporal distance matrix by clustering stations that measured similar data in space and time, we obtained better results optimizing the choice of the forecasting parameters for the computation of the model STClu-Arima. The number of stations where STClu-Arima performs better or equally than auto.ARIMA is always greater than the number of stations where auto. ARIMA overreach STClu-Arima.

Figure 5. (a) Spatially clustered sensor stations of Eco-Texas(Wind) in 7 clusters that measure similar spatio-temporal values. (b) Spatially clustered sensor stations of Eastern-Wind in 30 clusters (c) Spatially clustered sensor stations of SAC in 90 clusters

Download as

Veiw figure View Figure

Figure 6. (a) RMSE for auto. ARIMA vs STClu-Arima for Eco-Texas (Temperature) for 5 clusters. (b) RMSE for auto.ARIMA vs STClu-Arima for Eco-Texas(Wind) for 7 cluster. (c) RMSE for auto.ARIMA vs STClu-Arima for Eco-Texas (Ozone) for 4 clusters

Download as

Veiw figure View Figure

Figure 7. (a) RMSE auto.ARIMA vs STClu-Arima for Eastern-Wind data set for 30 cluster. (b) RMSE auto.ARIMA vs STClu-Arima for SAC data set for 90 clusters

Download as

Veiw figure View Figure