Time Series as Supervised Learning: TSFRESH for Classification
Hello everyone, and welcome. This is the third post in a series of posts on how to reframe a time series forecasting problem as a supervised learning problem using RapidMiner. The first two posts focused on how to do this by creating “lag features” and “rolling statistics” using only RapidMiner operators, and they focused on regression based machine learning problems. This post will discuss how to use a Python package called TSFRESH to achieve the reframing for a classification based machine learning problem using the “Execute Python” operator.
Let’s get started!
TSFRESH, which stands for “Time Series Feature extraction based on scalable hypothesis tests”, is a Python package for time series analysis that contains feature extraction methods and a feature selection algorithm. Currently, it automatically extracts 64 features from time series data that describe both basic and complex characteristics of a time series (such as the number of peaks, average value, maximum value, time reversal symmetry statistic, etc.), and those features can be used to build regression or classification based machine learning models. For the more technical user, it is possible to add custom features, but that is beyond the scope of this post.
There are two related datasets used in this post. The first dataset contains time series data recorded by 6 sensors from 88 robots. The second dataset contains information on whether a failure was reported by the robot or not. If you would like to follow along in RapidMiner, the first dataset can be downloaded here, and the second here.
The data used with the TSFRESH function discussed in this post can be in three different formats, but the default is what’s called a “flat dataframe”, where each time series must have its own column. This is the format that the data used in this post is in.
The “id” column (outlined in red) is mandatory since it tells TSFRESH which entity each time series belongs to, and, in our case, each “id” identifies an individual robot. The “time” column (outlined in blue) is optional but strongly recommended, and functions as the column for TSFRESH to sort the time series on. If this column is not included, TSFRESH will assume that the dataframe is already sorted in increasing order. The rest of the columns (outlined in green) contain the individual time series, and, in our case, they are the data obtained from each robot’s 6 sensors.
Regardless of the dataframe’s input form, the unfiltered outputted dataframe will always be in the same form where each column contains the data for a feature, and each row the data for an id:
In our case, this means that, unfiltered, each column contains the data for one of the 64 features calculated for each of the 6 sensors, and each row contains the data for one of the 88 robots. However, the function discussed in this post will perform filtering to exclude irrelevant features, so many of those calculated features will be removed.
First, the “timeseries” and “failures” data files are read in. Next, the data is moved into the “Execute Python” operator where code executing a TSFRESH function is used. After that, the data is joined together, and roles are set inside of the “Data Preparation” subprocess. Finally, the data is modelled inside of the “Cross Validation” operator using the Logistic Regression Operator, the model is applied to test data using the “Apply Model” operator, and the performance of the model is checked using the “Performance (Binomial Classification)” operator.
The most important function used is the “extract_relavant_features” function from TSFRESH (outlined above in red), and the output of this function is assigned the name “extracted_features”. This function is labelled as a “convenience” package within TSFRESH because it extracts the features, imputes/removes missing data, and filters out irrelevant features at the same time. There are 22 parameters that can be modified, but 3 must be specified at a minimum:
– timeseries_container (DataFrame): The dataframe with the time series data.
– column_id (str): The name of the id column to group by.
The “column_sort” parameter is technically optional, but highly recommended. This parameter is used to specify the column that contains values which can be used to sort the time series data (e.g. time stamps). If you omit this column, the dataframe is assumed to be already sorted in increasing order.
As specified above, “y” (“failures” in our case) must be a pandas.Series or numpy.array to be used in the function. Since it is currently a pandas.DataFame, it needs to be converted before it can be used (outlined above in blue), and the converted pandas.Series is named “failures_tmp”. The two parameters that need to be set are “data” and “index”, and the inputs need to be “array-like”. This can be done by using the pandas.values function. Set the “data” parameter to be equal to the “no_failure” column’s values, and then set the “index” parameter to the “id” column’s values.
The “extract_relevant_features” function uses the values of pandas.Series “index” to match up the correct rows when filtering out irrelevant features, so if you don’t set the “index” parameter to the “id” column’s values (or if you make “y” a numpy.array without an index) it is assumed that “y” has the same order and length as the “timeseries_container” so that the rows correspond to each other.
Next, the original “id” column in the “timeseries” dataframe becomes the index of the “extracted_features” dataframe. However, indexes cannot be assigned special roles in RapidMiner, so the index of the “extracted_features” dataframe needs to be turned back into a normal column. To do this, the “pandas.reset_index” function is used, and all the parameters can be left at their defaults.
Finally, make sure to return the new “extracted_features” dataframe and the original “failures” dataframe.
Now that the features have been extracted, the two datasets need to be joined together since the “Cross Validation” operator requires them to be. Leave all of the parameters at their defaults and click on the “Edit List” button (outlined in red) to bring up the window shown in the second picture. In that window, choose “id” as the key attribute for both dataframes.
The final step is to set the “id” and “label” roles. First, choose “id” for the attribute name and “id” as the target role, and then click on the “Edit List” button (outlined in red) to open a new window where additional roles can be set. In this window, choose “no_failure” as the attribute name and give it the target role “label”.
For the “Cross Validation” operator, all the parameters are left at their default settings, making it a ten-fold cross validation with “stratified sampling”. With stratified sampling, random subsets of the data are created, but the class distribution (defined by the label attribute) of each subset is the same as the whole ExampleSet. In this case, that means that each subset will contain the same distribution of “true” and “false” values as the whole ExampleSet. Inside of this operator the “Logistic Regression” operator with default parameters is used to build a model, but any classification modeling operator would work (i.e. Decision Tree, Random Forest, Neural Network, etc.). The trained model is then applied to testing data with the “Apply Model” operator, and the performance of the model is checked using the “Performance (Binomial Classification)” operator.
In summary, the “timeseries” and “failures” ExampleSets are fed into the “Execute Python” operator where the TSFRESH “extract_relevant_features” function is used to calculate statistics for each of the 88 robots across all 6 sensors. After that, the returned ExampleSets are moved into the “Data Preparation” subprocess where they are joined together, and then the roles “id” and “label” are given to the “id” and “no_failures” columns. Finally, the data is modelled inside of the “Cross Validation” operator using the Logistic Regression Operator, the model is applied to test data using the “Apply Model” operator, and the performance of the model is checked using the “Performance(Binomial Classification)” operator.