COMPANY

PAGE TOP

BLOG

Hadley Wickham’s “Tidy Data” in RapidMiner (Part 2)

Hello everyone, and welcome back!  In the last post, I introduced Hadley Wickham, his concept of “Tidy Data”, and how to solve the first of the five most common problems with “messy” data.  However, I only gave a general overview of the process and left out details that some people might need to use the De-Pivot operator effectively.  As a result, this post will go more in depth on the parameters of the De-Pivot operator and the regex I wrote.

Click to Open

First, let’s review the “attribute name” and “index attribute” parameters.  After clicking on the “Edit List” button (circled in orange), the “Edit Parameter List: attribute name” window seen below will appear.

Click to Open

The regex entered into the “attributes” box will capture the “fare range” columns from the “messy” table (circled in red below).  The name of each column will become the values of a column called “Fare Range” in the “tidy” table.  To name this new column, enter it into the “index attribute” box (circled in green above and below).  The values of the “fare range” columns are put into one column (circled in yellow below), and the value entered into the “attribute name” box becomes the name of that new column (circled in blue above and below).

Click to Open

Click to Open

Next, checking the “create nominal index” box is important if you want the values in the “Fare Range” column to be the actual fare ranges.  If you don’t check the box, then the values will become numbers instead.

Finally, let’s talk about the regex I used and how it captures the columns I wanted it to capture.  The regex I wrote contains a set of characters to search for, and this set is contained inside the brackets “[]”.  The set contains a less than sign “<”, a greater than sign “>”, a British Pound sign “£”, a hyphen “-“, and a range of any digit “0-9”.  The set is then followed an plus sign “+”.  The plus sign signifies searching for “one or more matches” for all characters in the set.  If you were to write out what the regex is trying to do, it would read something like “search for all strings that contain one or more of the characters in the preceding set”.

Click to Open

In addition, one thing you may have noticed was the “backslash” (\) just before the hyphen.  This is needed because of the special meaning of a hyphen when used inside of a set.  In regexes, hyphen’s are used to make ranges of characters to search for.  For example, “[0-9]” for all digits, “[a-z]” for all lowercase letters, or “[A-Z]” for all uppercase letters.  The backslash tells the regex that the hyphen is not being used to make a range.  That way, you can search for the hyphen as an actual character in a string.

As a side note, there are twelve characters that have special meanings in regexes:  the backslash “\”, the caret “^”, the dollar sign “$”, the period or dot “.”, the vertical bar or pipe symbol “|”, the question mark “?”, the asterisk or star “*”, the plus sign “+”, the opening parenthesis “(“, the closing parenthesis “)”, the opening square bracket “[“, and the opening curly brace “{“.  Anytime you want to search for one of those special characters, a backslash must be used to tell the regex that you want to ignore the characters’ special meaning.

To wrap up, I hope this tutorial has helped you understand how to use the De-Pivot operator to turn the “messy” pivot table into a “tidy” table.  If you aren’t very familiar with regexes, then that will likely be the trickiest part of using this operator.  Regexes are a very powerful tool and can be used in many RapidMiner operators to search for and edit strings.  Sometimes, like with the De-Pivot operator, a regex is the only option available.  So, the sooner you dive in and learn how to craft good regexes the better.

In the next post, I will take on the second problem presented by Hadley.  The first step in the process will be just like this one, involving the “De-Pivot” operator, but a little extra work will be required to make the data completely “tidy”.

See you then!

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$
PAGE TOP