Time Series as Supervised Learning: Rolling Statistics

Hello everyone, and welcome.  This post is the second post in a series about how to reframe a time series forecasting problem as a supervised learning problem using RapidMiner.  In the first post, I briefly discussed reasons why you would want to do this, the concept of windowing and how it works, and about creating lag features as the most basic way to accomplish the reframing.  If you are new to all these concepts, I would suggest reading that post first.

In this post, I will talk about a second way to create features for this reframing by calculating what are called “rolling statistics” on the lag features used in each window.  Last time, I said that the “Time Series” extension needed to be downloaded to gain access to the operators and data used in the tutorial, and this is still the case if you haven’t upgraded to RapidMiner 9.  However, if you have upgraded to RapidMiner 9 the time series operators and data used are included by default, and the data can be found in the Repository Panel (outlined below in red):


Process Overview

As can be seen above, the basic flow of the process is to retrieve the data and then send it into the “Process Windows” operator.  Inside of that operator is an “Extract Aggregates” operator that will calculate the “rolling statistics” (such as the sum, mean, std, etc.…) from the lag variables captured by the “Process Windows” operator.  A collection of all the windows is then passed to the “Append” operator that appends the windows together.  Finally, the data is modeled using the “Linear Regression” operator inside of the “Cross Validation” operator, the model is applied to the test data with the “Apply Model” operator, and the performance of the model is checked using the “Performance” operator.


The “Process Windows” Operator

Going from top to bottom, first choose “single” for the “attribute filter type” and then set the “time series attribute” parameter to “Lake surface level / feet” (since this is the attribute that contains the values we need to transform).  After that, make sure that the “has indices” box is checked (outlined above in red) and that the attribute “Date” is chosen for the “indices attribute” (this is optional, since the “id” attribute is not used at all during modeling.  It just helps with keeping track of the examples).  Next is “window size” and “step size”.  The number chosen for “window size” will decide how many prior values will be used to create the window, and, like in the last post, I chose to only use two prior values.  The “step size” parameter determines where each new window will start.  In our case, since the “step size” value is set to 1, the first window will contain the first and second examples, the next window will contain the second and third examples, the next window the third and fourth examples, etc., continuing until the end of the data.  After that, make sure to check the “create horizon (labels)” box (outlined above in blue) so that the labeled data needed for training is generated, set the “horizon width” to 1 (since we want to predict one value), and the “horizon offset” to 0 (so that the value immediately after each window is used as the label).  Finally, if you choose to create the “id” attribute created by checking the “has indices” box, make sure to check the “add last index in window attribute” box (outlined above in green).  If this isn’t done, the “id” attribute will be removed during the next step.


Calculating the Rolling Statistics

Within the “Process Windows” operator is the “Extract Aggregates” operator, and this operator calculates “rolling statistics” from the lag variables created by the “Process Windows” operator.  For this demonstration I simply calculated the sum, mean, min, max, and std deviation.  Depending on the size of the window, the first quartile, mode, and third quartile could also be calculated.  However, since there are only two values in each window, these three statistics will all be the same number.  As a result, there is no point in calculating them.

One final note on this operator.  As discussed previously, if you want to keep the “id” attribute created by the “Process Windows” operator, the “add time series name” parameter needs to be checked (outlined above in red).  As shown below. If this attribute is not checked, an “id” attribute will be added that includes the name of the time series (outlined below in red), and the “id” label will be removed from the “Last Date in window” attribute (outlined below in blue):

If this box is checked, the name of the time series will instead be prefixed to the names of each “rolling statistic” attribute (outlined below in red), and the “Last Date in window” attribute will keep its “id” label (outlined below in blue):


Appending the Windows

For the next step, there are no parameters to set, but I want to briefly discuss why the “Append” operator is necessary.  As shown above in the first picture, after the data comes out of the “Process Windows” operator the data has been split up into individual ExampleSets of one row each corresponding to the individual windows.  As a result, the “Append” operator is needed to merge the individual windows together by stacking them on top of one another (shown in the second picture).  Now the data is ready for modeling.


Modeling the Data and Plotting the Results

The modeling of the data, and the evaluation of and plotting of the results, is performed in the same way as last time.  The “windowed” data is passed into the “Cross Validation” operator, and all the parameters have been left at their default (making this a 10-fold cross validation with shuffle sampling).  Inside I used a “Linear Regression” operator with default parameters to train on the training data, the “Apply Model” operator to apply the trained model to the test data, and the “Accuracy” operator to check the accuracy of the model.  By default, the “root mean squared error” (RMSE) is returned, but any number of “performance criterion” can be calculated by checking their boxes.  Evaluating these results can be done by viewing the raw “performance criterion” numbers in the “Results” window and plotting these results can be done using a “Observed vs Predicted” or “Residual” plot.



In summary, the main operators used to turn a time series problem into a supervised learning problem are the “Process Windows” and “Extract Aggregates” operators, and this post focused on how to use them together to calculate “rolling statistics” from lag variables.  With the “Process Windows” operator, the “window size” determines how many prior values will be used to create the window, the “step size” determines where each new window will start, the “horizon width” determines how far into the future you want to predict, and the “horizon offset” determines where to begin the horizon at.  The “rolling statistics” are then calculated using the “Extract Aggregates” operator, and the individual ExampleSets created for each window are appended together using the “Append” operator.  After that, the data can be used with any of the standard machine learning algorithms to make predictions with, and the results can be evaluated by looking at the “performance criterion” numbers and plots.

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$