COMPANY

PAGE TOP

BLOG

Hadley Wickham’s “Tidy Data” in RapidMiner (Part 5)

Hello everyone, and welcome back!  In my last post, I wrote about a more complicated example for the second most common problem discussed by Hadley Wickham, where multiple variables are stored in one column.  In this post, I will move on to the third most common problem, where variables are stored in both rows and columns.  The data used for this, like the data used as an example by Hadley Wickham in his paper, comes from the Global Historical Climatology Network.  The data is weather data from a single weather station located in Death Valley, CA, USA.  If you would like to follow along in RapidMiner as you read, click here to download the “messy” data.

Problem 3: Multiple Values are Stored in Both Rows and Columns

First let’s get a look at the “messy” version of the data I started with, and the “tidy” version of the data that will result from the cleaning process.

Click to Open

Click to Open

The “day” attributes (circled in green) in the “messy” data need to be turned into values and combined with the values in the “year” and “month” attributes (circled in red).  This will form the “date” attribute in the “tidy” dataset (circled in yellow).  In addition, the “TMAX” and “TMIN” values that are stored in the “element” attribute (circled in blue) in the “messy” data need to be turned into their own attributes called “TMAX” and “TMIN” in the “tidy” data (circled in blue).

Now, let’s look at an overview of the process and summarize what will need to be done.

Click to Open

Click to Open

The two processes shown above are the main process and the subprocess contained within it.  The main process contains four operators and one subprocess, and that subprocess contains two more operators.  Let’s look at each operator individually and go over each part step-by-step.

Click to Open

First, the De-Pivot operator.  Click on the operator to show the “Parameters” box on the right.  Then, click on the “Edit List” button for the “attribute name” parameter.  This will open the “Edit Parameter List: attribute name” box shown on the left.  The attributes you want to select are the “day” attributes that were circled in green up above in the “messy” data example.  To select them, enter a regex into the “attributes” box.  The regex I entered finds all strings that contain the letter “d” followed by any digit “[0-9]” and a plus sign quantifier “+” that tells the regex to find “one or more of the preceding element”.  For more help when crafting your regex, click in the button circled in blue.  This will bring up the “Edit Regular Expression” box shown below.  Here, you can see the possible strings to search for in the “Item Shortcuts” box (circled in red), and the strings that your regex captures in the “Matched Items” box (circled in blue).

Click to Open

The values contained within these captured “day” attributes will become their own column, so the name of this new column (which I named “values”) must be typed into the “attribute name” box.  The name of the new column that will contain the “day” attributes as values (which I named “day”) must be typed into the “index attribute” box.  The data that comes out of the “De-Pivot” operator will look like this:

Click to Open

This data will then be fed into the “Create Date Attribute” subprocess, where it will be further transformed by the “Generate Attributes” and “Select Attributes” operators.  Clicking on the “Generate Attributes” operator will bring up the “Parameters” box shown below on the right.

Click to Open

Click on the “Edit List” button (circled in red) to bring up the “Edit Parameter List: function descriptions” window shown above on the left side.  Enter the name you want for the new attribute you will create in the “attribute name” box, and then enter in the functions you want to use to create the attribute in the “function expression” box.  For help with crafting your function, click on the calculator button (circled in blue).  This will bring up the “Edit Expression: function expressions” window shown below.

Click to Open

Click to Open

The “concat” function I use concatenates the “year”, “month”, and “day” attributes together with two hyphens.  The “cut” function modifies the “day” attribute before the concatenation occurs by cutting out the letter “d” from all the values contained within that attribute.  After that, the data (example below) is fed into the “Select Attributes” operator.

Click to Open

Click to Open

Clicking on the operator will bring up the “Parameters” shown above on the right.  Starting with the “attribute filter type” parameter, click on the down arrow (circled in red) to bring up the options.  There are multiple options to choose from (such as all, single, regular expression, etc.…), but since there are only a few attributes to drop, doing it by hand with the “subset” option is the one I chose.  To choose which attributes to drop, click on the “Select Attributes” button (circled in blue).  This will show the “Select Attributes: attributes” window shown above on the left.  Click on the attributes you want to drop from the “Attributes” box on the left side (in this case, “day”, “month”, and “year”), and move them over to the “Selected Attributes” box on the right side by using the left and right arrows (circled in green).  The data that comes out of this operator can be seen below.

Click to Open

Click to Open

Next, the data goes into the “Reorder Attributes” operator.  First, you need to specify the “sort mode”, so click on the down arrow (circled in red) and choose “user specified”.  Next, click on the “Define Order” box (circled in blue) to show the “Select Ordering Rules: attribute ordering” window shown above on the left.  Move the columns you need from the “Attributes” box on the left to the “Attribute Ordering” box on the right by using the right arrow button (circled in green), then put them in the correct order by clicking on an attribute and moving it up or down with the up and down arrow buttons (circled in purple).  Now the data should look like the example shown below.

Click to Open

Click to Open

At this point, the data is almost “tidy”, with the last step being to create the “TMAX” and “TMIN” attributes.  To do this, the “Pivot” operator is used.  Click on the down arrow (circled in red) and set the “group attribute” parameter to “date”, and then click in the down arrow (circled in blue) and set the “index attribute” parameter to “element”.  Since data is only being shifted around, and no calculations are being made, uncheck the “consider weights” box circled in green.  Everything else should be left as is.  The data should now look like the example shown below.

Click to Open

Click to Open

The last thing to do is to Rename the attributes.  After pivoting the Data, the names of the new Attributes become a combination of the old name “values” and the two values from the old “element” attribute “TMAX” and “TMIN”.  To change the names, click in the down arrow (circled in red) and choose the old name from the “old name” box, then enter in the new name in the “new name” box (circled in blue).  To rename the next and final attribute, click on the “Edit List” button (circled in green), and repeat the same process in the “Edit Parameter List: rename additional attribute” window that opens (shown above in the left side).  After that, click run, and the “tidy” data (shown below) will appear in the “Results” tab.

Click to Open

In summary, the “day” attributes were de-pivoted using the de-pivot operator and turned into the “day” and “values” attributes as a result.  The “date” attribute was then created by concatenating the “year”, “month”, and “day” attributes together, and then those three were dropped by using the “Select Attributes” operator.  Next, the data was reordered and the “element” attribute was pivoted to create the “values_TMAX” and “values_TMIN” attributes.  Finally, those attributes were then renamed to “TMAX” and “TMIN”.

In the next post, I will cover the fourth problem discussed by Hadley Wickham, where multiple types of observational units are stored in the same table.

See you then!

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$
PAGE TOP