Time Series as Supervised Learning: Lag Variables
Hello everyone. This post is about how to reframe a time series forecasting problem as a supervised learning problem using RapidMiner. The data that I work with in this tutorial is the “Lake Huron” dataset provided in the “data sets” folder found inside the “Time Series Extension Samples” folder. However, this extension, and those folders, are not provided by RapidMiner by default, so if you would like to follow along with this post, and you haven’t already installed the “Time Series” extension, please find it in the Marketplace and install it from there.
Why would you want to reframe a time series forecasting problem as a supervised learning problem? The answer to that depends on your circumstances, but one reason could be that you have tried traditional time series approaches, such as an ARIMA model, and those haven’t produced the results you want. Also, perhaps you came into machine learning having never learned how to perform traditional time series forecasting, but you need to deal with data where time is an essential component of the analysis. Either way, after reframing the problem you will have access to all the standard linear and nonlinear machine learning algorithms. However, that also means you need to calculate information that can be used as inputs into the standard machine learning algorithms for them to learn from. While there are a few different ways to create these features, this post will focus on the most basic kind called lag features.
As can be seen above, the basic flow of the process is to retrieve the data and then use the “Windowing” operator to create the lag features. After that, the data is modeled using the “Linear Regression” operator inside of the “Cross Validation” operator, the model is applied to the test data with the “Apply Model” operator, and the performance of the model is checked using the “Performance” operator.
The Lake Huron dataset is a simple univariate time series dataset that consists of simply the water surface level of the lake (in feet) recorded once every year on the same date. However, this data cannot be used as is with traditional machine learning models used for supervised learning because these models require input attributes that can be learned from and are associated with the values you want to predict. As a result, these attributes must be created, and the basic way to do this is to use the windowing technique.
For example, let’s say we want to use the previous two years of water level data to predict the next years water level. In the example shown above, that means that the two water level data points captured by the red line are used to predict the data circled in blue, and the two water level data points captured by the green line are used to predict the data circled in purple. This continues until all the data has been transformed, and the data that comes out of the “Windowing” operator looks like the example below.
The “Windowing” Operator
Now, let’s look at how to perform this transformation in RapidMiner using the “Windowing” operator. Going from top to bottom, first set the “time series attribute” parameter to the name of the attribute that contains the values you want to transform (in our case, this is the “Lake surface level / feet” attribute, make sure “has indices” is checked, and then choose “Date” for the “indices attribute”. Next is “window size”, and the number put here will decide how many prior values will be used to create the window. Like the example given earlier, I chose to only use two prior values. Next is “step size”, and this parameter determines where each new window will start. In our case, since the “step size” value is set to 1, the first window will contain the first and second examples, the next window will contain the second and third examples, the next window the third and fourth examples, etc., continuing until the end of the data. After that, make sure to check the “create horizon (labels)” box so that the labeled data needed for training is generated, set the “horizon width” to 1 (since we want to predict one value), and the “horizon offset” to 0 (so that the value immediately after each window is used as the label).
Modeling the Data
Next, the “windowed” data is passed into the “Cross Validation” operator, and I simply left all the parameters at their default (making this a 10-fold cross validation with shuffle sampling). Inside I used a “Linear Regression” operator with default parameters to train on the training data, the “Apply Model” operator to apply the trained model to the test data, and the “Accuracy” operator to check the accuracy of the model. By default, the “root mean squared error” (RMSE) is returned, but I chose to have the “absolute error” (also called “mean absolute error” or MAE) returned as well.
Evaluating the Results
After the entire process has finished running (which should only take a few seconds) the output that shows up in the “Results” view will be the performance of the model, the ExampleSet containing the model’s predictions, and the model itself. The first thing to look at are the performance metrics to get a quick overview of how well the model performed.
The first metric shown is the RMSE, which tells you, on average, how far off each prediction was from the actual value in terms of the original unit of measure. In this case, the unit of measure was feet, so the model was able to predict the surface level of the lake within less than a foot (~8.05 inches). Whether this is good or bad depends on how accurate you need the model to be, and that is where domain knowledge of the problem at hand comes into play.
Next, we have MAE, and it is very closely related to the RMSE in that it also tells you, on average, how far off each prediction was from the actual value in terms of the original unit of measure. As can be seen, however, the scores are different, with the MAE being lower than the RMSE. This is because the errors calculated by the RMSE are squared before they are averaged, which leads to the RMSE giving a relatively high weight to large errors. This is good if large errors are bad for a given problem and should be penalized, so which one is better will depend on each individual situation.
Plotting the Results
Finally, using the “Charts” feature I plotted two different kinds of plots that are commonly used to get a visual feel for the performance of the model. The first one (pictured above) plots the observed values against the predicted values. First, make sure that the “Chart style” is set to “Scatter Multiple” (circled in red), and then make sure that the “prediction(Lake surface level / feet + 1 (horizon))” attribute is on the x-axis (circled in blue) and that the “Lake surface level / feet + 1 (horizon)” attribute is on the y-axis (circled in green). In addition, to plot the red line that represents perfect predictions (also called a “1:1” line), plot the “prediction(Lake surface level / feet + 1 (horizon))” attribute on the y-axis, and then click on the “Points and Lines…” button (circled in purple). This will show the “Points and Lines” window seen below.
In this window, make sure the “Points” box for the predictions dimension is unchecked and make sure the “Lines” box is checked (circled in red). This will produce the red line seen in the chart. The lower the RMSE and MAE the more tightly clustered the plotted points will be around the red line.
Another kind of plot that can reveal useful information about the error in the model is a residual plot. However, it requires information that will need to be created using a “Generate Attributes” operator, so let’s look at how to do that now.
To create a residual plot in RapidMiner, first connect the “Cross Validation” operator’s test result set output port (labeled “tes”) to the “Generate Attributes” operator’s example set input port (labeled “exa”). Next, click on the “Generate Attributes” operator to show the “Parameters” window shown on the right and click on the “Edit List” button (circled in red) to show the “Edit Parameter List: function descriptions” window shown at the bottom left. In this window, you want to create a “residuals” attribute by subtracting the predicted values from the observed values (Residuals = Observed – Predicted). In addition, I created a “perfect-fit” variable so that a line could be drawn at this point on the plot. The names of the variables should be typed into the “attribute name” boxes (circled in blue), and the functions used to create the attributes’ values should by typed into the “function expressions” boxes (circled in green).
The way to create this chart is the same as the previous chart. The only difference is that the “residuals” and “perfect-fit” attributes should be plotted on the y-axis instead. The plot shouldn’t show any clear error patterns and the data should be clustered as closely as possible around the “perfect-fit” line. This residual plot is a pretty good example of what a healthy residual plot should look like.
In summary, the main operator used to turn a time series problem into a supervised learning problem is the “Windowing” operator, and this post focused on how to use that operator to create lag features. The “window size” determines how many prior values will be used to create the window, the “step size” determines where each new window will start, the “horizon width” determines how far into the future you want to predict, and the “horizon offset” determines where to begin the horizon at. After that, the data can be used with any of the standard machine learning algorithms to make predictions with. Once the data has been modeled and predictions made, the accuracy/error of the model can be checked using the error scores RMSE and MAE provided by the “Performance” operator, and the accuracy/error can be visually inspected using an “Observed vs Predicted” plot and a “Residual” plot.