COMPANY

PAGE TOP

BLOG

Hadley Wickham’s “Tidy Data” in RapidMiner (Part 4)

Hello everyone, and welcome back!  As I mentioned last time, this post will be an extension of the last one.  In that post, I discussed how to solve the second of the five most common problems discussed by Hadley Wickham, where multiple variables are stored in one column.  This time, however, I will use a more complicated example and use two new operators in place of the “Split” and “Rename” operators used last time.  If you need a review of the parts I skip over in this post, please go back and read the last one.  Also, if you would like to follow along in RapidMiner as you read, click here to download the “messy” data used for this post.

The reason why this example is more complicated than the last one is because of how the original “Sex_Age” data is formatted.  Let’s look at the data from the original less complicated example after it comes out of the “De-Pivot” operator.

Click to Open

The elements that we want to split into two separate columns are the “Sex” and “Age” elements.  To take the first “Sex_Age” data point as an example (“F_0-14”), the “Sex” element is the “F” and the “Age” element is the “0-14”.  To separate these elements, I used the “Split” operator, which uses a regex to find and split the elements by the underscore “_” that separates them.  However, using a regex to split different elements of a string requires a character or whitespace that can be used to split on.  If the string you want to split doesn’t have a character to split on, the “Split” operator can’t be used.

Click to Open

The data for this more complicated example (shown above) is such an example.  The “Sex” and “Age” elements have no separating characters to split on, so a different approach was needed.  Let’s look at the overall process for this more complicated example.

Click to Open

As you can see, the “Split” and “Rename” operators used last time have been replaced by the “Generate Attributes” and “Select Attributes” operators.  Let’s look at these two operators in more depth, starting first with the “Generate Attributes” operator.

Click to Open

Clicking on the “Generate Attributes” operator will open the parameters window shown above.  To create the attributes you need, click on the “Edit List” button (circled above in red).  This will open the “Edit Parameter List: function descriptions” window shown below.

Click to Open

Within this window, write the names of the attributes you need to create in the “attribute name” boxes, and the functions that will create those attributes in the “function expressions” boxes.  To add or remove entries, click on the “Add Entry” and “Remove Entry” buttons (circled in red).  If you need help crafting any of the “function expressions”, click on one of the calculator buttons on the right (circled in blue) to bring up the “Edit Expressions: function expressions” window shown below.

Click to Open

In this window there is the “Expression” box where you write a function, the “Functions” box where you can choose functions to use, and the “Inputs” box where you can choose attributes, basic constants, data functions, and macros.  The “cut” function shown above is the one I used to create the “Sex” attribute.  It takes in a string to cut (“Nominal text”) and two numbers that represent the start and end location to make the cut at (“Numeric start” and “Numeric length”).  In our case the “Nominal Text” string is the name of the attribute “Sex_Ages”.  Doing this allows all the string values contained within the column to be cut.  Then, the “Numeric start” of 0 and “Numeric length” of 1 captures the “M” or “F” within the string.

Click to Open

Next, I use the “cut” function again to create the “Age” attribute, but I use the “length” function to automatically calculate the end value.  I did this because the lengths of the “Age” elements in “Sex_Ages” range from a length of 3-5 characters.

Click to Open

Finally, I edit the “Age” attribute I just created by using two “if” functions and a “replaceAll” function.  As shown above, the “if” function only allows for one “conditional if, then, else” statement.  So, if multiple conditions are needed, multiple “if” functions must be chained together, with the next “if” function becoming the “else” value of the preceding “if” function.  Finally, the “replaceAll” function is used as the “else” value in the second “if” statement and ends the expression.

Click to Open

The “replaceAll” function takes in a string (in this case the attribute “Age”) and uses a regex to find and edit the remaining strings not covered by the first two “if” statements.  The regex is made up of two “capturing groups” (found inside the two sets of parentheses), and the new replacement string is made from the matches made by the first and second “capturing groups” separated by a hyphen.  The outputted data can be seen below.

Click to Open

Click to Open

That data is then fed into the “Select Attributes” operator.  Here, the no longer needed “Sex_Ages” attribute is dropped by selecting “single” for the “attribute filter type” (circled in red) and selecting “Sex_Ages” as the attribute to filter (circled in blue).  Finally, the “invert selection” box should be checked so that “Sex_Ages” is dropped (circled in green).  If you don’t, “Sex_Ages” will be kept and all other attributes will be dropped.  Here is the outputted data.

Click to Open

At this point, the only operator left is “Reorder Attributes”, and it is used in the same way as the last post.  If you need a review on how it works, please refer to that post.  Here is how the outputted “tidy” data should look like.

Click to Open

In summary, after the data comes out of the “De-Pivot” operator, the “Generate Attributes” operator is used to write expressions that create the “Sex” and “Age” attributes from the original “Sex_Ages” attribute.  The “Sex_Ages” attribute is then dropped using the “Select Attributes” operator, and the “Reorder Attributes” operator puts all the attributes in the correct order.

I hope this tutorial has been helpful, and that you will be able to take what you have learned from this and apply it in your own analysis.  In the next post, I will deal with the third of the five most common problems with messy data discussed by Hadley Wickham, where variables are stored in both rows and columns.

See you then!

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$
PAGE TOP