COMPANY

PAGE TOP

BLOG

Hadley Wickham’s “Tidy Data” in RapidMiner (Part 6)

Hello everyone, and welcome back!  In my last post, I wrote about the third most common problem discussed by Hadley Wickham, where variables are stored in both rows and columns.  In this post, I will discuss the fourth problem discussed by Hadley Wickham, where multiple types of observational units are stored in the same table.  The data used for this, like the data used as an example by Hadley Wickham in his paper, is data from Billboard’s Hot 100 list.  If you would like to follow along in RapidMiner as you read, the “messy” data can be found here (clicking will begin download).

Problem 4: Multiple Types of Observational Units are Stored in the Same Table

First, let’s get a look at the “messy” version of the data I started with and the “tidy” version of the data that will result from the cleaning process.

“Messy” Data (Click to Open)

“Tidy” Song Information Data (Click to Open)

“Tidy” Rank Information Data (Click to Open)

The “messy” data this time isn’t messy in the same way as the previous posts.  In Hadley Wickham’s original paper, he says that data is tidy once it is in this form:

• Each variable forms a column (attribute)
• Each observation forms a row (example)
• Each type of observational unit forms a table (ExampleSet)

In our case, this data meets the first two criteria, but it fails in the third.  This is because it contains observations on two types of observational units: the song information (circled in red) and its rank information (circled in blue).  That is why the “tidy” data shown consists of two tables.  Why is this important?  Well, in his paper, Hadley Wickham states this:

Datasets often involve values collected at multiple levels, on different types of observational units.  During tidying, each type of observational unit should be stored in its own table.  This is closely related to the idea of database normalization, where each fact is expressed in only one place.  If this is not done, it is possible for inconsistencies to occur.

However, while this is useful for tidying data and eliminating inconsistencies, most data analysis tools can’t work with relational data.  As a result, data that is split up into observational units for tidying will most likely have to be merged back into one table for analysis.

The Process

(Click to Open)

First, let me quickly summarize the overall process (see above), and then I will explain each operator individually.  The “messy” data is multiplied into two separate branches, and then I create an “id” column by using the data in one of the branches.  Those branches are then merged back together so that the “id” column is added to the original dataset.  I then multiply the data again into two separate branches, with the top branch containing the “song information” and the bottom branch containing the “rank information”.  Now, let’s look at each operator individually.

Part 1: Generating IDs

To start, the “messy” data is fed into the “Multiply” operator, and this operator is very simple.  All it does is make copies of the data, and multiple copies can be created and used by connecting an output port to another operator.  In this case, I only make two copies of the data, so only two output ports are used.  The data that comes out of the top branch is used to generate an id.  To do this, I fist use the “Aggregate” operator.

Aggregate Parameters (Click to Open)

First, click on the operator to open the “Parameters” window shown above on the right, and then click on the “Edit List” button (circled in red) to open the “Select Attributes: group by attributes” window shown above on the left.  Next, find and click on the “title” attribute listed on the left in the “Attributes” box, and then move it to the “Selected Attributes” box on the right by using the right arrow button (circled in blue).  Finally, click the “Apply” button.

Generate ID Parameters (Click to Open)

Next, click on the “Generate ID” operator to open the “Parameters” window shown above.  In this case, there aren’t any settings you need to change.  The “create nominal ids” box is unchecked by default, so integer ids are created (1, 2, 3, etc…), but if you check the “create nominal ids” box the id’s will become nominal ids (id_1, id_2, id_3, etc…).  The data that comes out of this operator will look like the example below.

Generate ID Output Data (Click to Open)

Part 2: Joining Datasets

Join Parameters (Click to Open)

Now, let’s move on to the “Join” operator.  First, click on the operator to open the “Parameters” window shown above on the right.  Next, set “join type” to “left” (circled in red), and then make sure to uncheck the “use id attribute as key” parameter (circled in blue).  Finally, click on the “Edit List” button (circled in green) to bring up the “Edit Parameter List: key attributes” window shown above on the left, set the “left key attributes” and “right key attributes” to “title”, and click apply.  An example of the data that comes out of this operator can be seen below.

Join Output Data (Click to Open)

Par 3: Song Information Branch

Next, the data is multiplied again using the “Multiply” operator, and this results in two different branches.  The top branch will result in the “tidy” data that will contain only song information, and this is achieved by using a “Select Attributes” and “Remove Duplicates” operator.

Select Attributes Parameters (Click to Open)

Fist, click on the “Select Attributes operator to open the “Parameters” window shown above on the right, and then set the “attribute filter type” to “subset” (circled in red).  Next, click in the “Select Attributes…” button (circled in blue) to open the “Select Attributes: attributes” window shown above on the left.  Choose “artist”, “id”, and “title” from the “Attributes” box on the left, and then move them to the “Selected Attributes” box on the right using the right arrow key (circled in green).  Finally, click apply.  An example of the data that will come out of this operator can be seen below.

Select Attributes Output Data (Click to Open)

Remove Duplicates Parameters (Click to Open)

Next is the “Remove Duplicates” operator, but there is nothing that needs to be done with its parameters (see above).  There are many different options to choose from with the “attribute filter type” parameter, and extra parameters will show up depending on what you choose, but we want to make sure there aren’t any duplicates for all our attributes.  Also, special attributes can be dropped by checking the “include special attributes” box, but we want to keep our special “id” attribute, so leave it unchecked.  The final “tidy” data for this branch can be seen below.

“Tidy” Song Information Data (Click to Open)

Part 4: Rank Information Branch

The bottom branch that was created by the “Multiply” operator only contains a “Select Attributes” operator.  This branch will result in the “tidy” data that contains only rank information for each song.

Select Attributes Parameters (Click to Open)

The “Select Attributes” operator is used very similarly to before, but this time we are selecting different attributes.  First, click on the operator to open the “Parameters” window shown above on the right, and then set the “attribute filter type” to “subset”.  Next, make sure the “invert selection” box is checked (circled in red).  Doing that will save us work in the next step.  Finally, click on the “Select Attributes” button, move the “artist” and “title” attributes to the right side, and then click “Apply”.  Since we checked the “invert selection” box previously, the selected “artist” and “title” attributes will be dropped while all the others will be kept.  The final “tidy” data for this branch can be seen below.

“Tidy” Rank Information Data (Click to Open)

Conclusion

In summary, the original “messy” data was duplicated using the “Multiply” operator, and the top branch was used to generate an id using the “Aggregate” and “Generate ID” operators.  The two branches were then joined together using the “Join” operator.  The data was then duplicated again using the “Multiply” operator, and the top branch resulted in the “tidy” data containing only song information, while the bottom branch resulted in the “tidy” data containing only rank information.  The top branch created the “song information” data using the “Select Attributes” and “Remove Duplicates” operators, and the bottom branch created the “rank information” data by using a “Select Attributes” operator.

I hope this post has been helpful, and I hope that you will be able to use some of what you learned in your own analysis.  In the next post, I will discuss the fifth and final problem with “messy” data discussed by Hadley Wickham, where a single observational unit is stored in multiple tables.

See you then!

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$
PAGE TOP