COMPANY

PAGE TOP

BLOG

Hadley Wickham’s “Tidy Data” in RapidMiner (Part 7)

Hello everyone, and welcome back!  This is the final post in this series, and I hope that all the posts have provided you with the help you need to perform any number of data cleaning tasks using RapidMiner.  In the last post, I discussed the fourth problem discussed by Hadley Wickham, where multiple types of observational units are stored in the same table.  In this post, I will discuss the fifth and final problem with “messy” data, where a single observational unit is stored in multiple tables.  The data used for this, like the data used as an example by Hadley Wickham in his paper, is data on baby name provided by the US Social Security Administration.  If you would like to follow along in RapidMiner as you read, the “messy” data can be found here (clicking will begin download).

Problem 5:  A Single Observational Unit is Stored in Multiple Tables

“Messy” Data (Click to Open)

“Tidy” Data (Click to Open)

As shown above, the “messy” data is split up into many different tables.  Once they are read into RapidMiner, they are saved as an IOObjectCollection (circled in red), so these will all need to be combined into one table.  As a result, the “messy” and “tidy” data won’t look that different since the shapes of the data are the same.  The main difference, as shown above in the “tidy” data example, is in the number of examples (circled in blue).  Now, let’s have a look at the overall process.

Main Process (Click to Open)

“Loop Files” Subprocess (Click to Open)

The main process contains a “Loop Files”, “Append”, and “Rename” attribute, and inside the “Loop Files” operator is a “Read CSV” and “Generate Attributes” operator.  In summary, the flow of the process goes like this.  The data files are read in individually using the “Read CSV” operator, and then a new attribute is added to each file using the “Generate Attributes” operator.  This is all done inside of a loop, and all the files will be outputted together as an “IOObjectCollection”.  The individual files are then merged together using the “Append” operator, but three of the attributes have automatically generated names of “att1”, att2”, and “att3”.  As a result, I use the “Rename” operator to give them more appropriate names.  Now, let’s go over each operator individually.

Part 1: Reading in the Files

“Loop Files” Parameters Window (Click to Open)

First, click on the “Loop Files” operator to open the “Parameters” window shown above on the right.  Next, enter in the file path into the “directory” box (circled in red), or click on the file icon to open the “Select Directory” window shown above on the left and choose the correct folder that way.  After that, you need to choose how to read in the files you want.  You can do this by choosing “glob” or “regex” from the “filter type” box (circled in blue), and in my case I decided to choose regex since I am only reading in files from a single directory.  After choosing “regex”, the “filter by regex” box will appear, and this is where you need to enter in the regex you will use to find only the files you need (circled in green).  In our case, the only file we don’t want to read in is a single “README” file called “NationalReadMe”.  Everything else is the “.txt” files we need to read in.  The regex I use is “.+txt”, which roughly means “look for all files that have a name that ends in ‘txt’ and starts with one or more of any character”.  If you are new to regexes, or just need a review, please refer to this post.  Finally, make sure to check the “enable macros” box (circled in orange).  If you aren’t already familiar with macros in RapidMiner, a macro is a global variable that can be used in all parameters of all operators.  In our case, the “Loop Files” operator creates three macros for us:

  • file_name: the name of the current file, without the path
  • file_path: the name of the current file, including the complete path
  • parent_path: the path of the current file, without the filename

In our case, only the “file_name” macro will be needed.  Now, let’s go into the subprocess of the “Loop Files” operator, go over how to use the “Read CSV” and “Generate Attributes” operators, and talk about how we will use the “file_name” macro.

“Read CSV” Parameters Window (Click to Open)

For the “Read CSV” operator, I only changed the value entered into the “column separators” box (circled in red).  By default, the value is a semicolon, but our files use a comma as a column separator, so a comma should be entered in to this box.

As the data is read into RapidMiner it gets sent into the “Generate Attributes” operator.  It is here where the macro “file_name” is used.  When a single type of observational unit is spread out over multiple tables or files, Hadley Wickham states that “These tables and files are often split up by another variable, so that each represents a single year, person, or location”.  In our case, the files have been split up by years, with each file starting with a “yob” and ending with a year (ie. yob1880).  As this is important information, the file name should be turned into an attribute, so this is where the “Generate Attributes” operator comes in.

“Generate Attributes” Parameters Window (Click to Open)

First, click on the “Generate Attributes” operator to open the “Parameters” window shown above on the right, and then click on the “Edit List” button (circled in red) to open the “Edit Parameter List: function descriptions” window shown above on the left.  Next, enter the name you want to call the new attribute in the “attribute name” box (I named it “year_of_birth”), and the function you will use to create the attribute in the “function expressions” box.  For help crafting the function, click on the calculator button (circled in blue) to open the “Edit Expression: function expressions” window shown below.

“Edit Expression: function expressions” Window (Click to Open)

There are three main boxes in this window, the “Expression” box at the top, the “Functions” box at the bottom left, and the “Inputs” box at the bottom right.  A collection of all the functions available in this operator can be found by manually looking through them in the “Functions” box or searching for them in the search bar (circled in blue).  In this case, I used the “cut” function, which takes in a string to cut (“Nominal text”) and two numbers that represent the start and end location to make the cut at (“Numeric start” and “Numeric length”).  The “Nominal Text” string needs to be the name of the current file being read into RapidMiner, so this is where the “file_name” macro will be used.  To find and use this macro, click on the “Macros” tab inside the “Inputs” box, and then click on the “file_name” macro (circled in green).  In RapidMiner, Macros are referenced using the form %{macro_name}, and in this case the actual macro is referenced as %{file_name} inside the “cut” function (circled in red).  The “cut function then cuts each file name at the start position of 3, and then stops at a length 4.  This gets rid of the “yob” at the beginning of each file name and keeps only the year at the end.  An example of the data that comes out of this operator can be seen below.

“Loop Files” Output Data (Click to Open)

Part 2: Appending the Datasets (ExampleSets) and Renaming Columns

“Append” Parameters Window (Click to Open)

As you can see above in the example of the data that comes out of the “Loop Files” operator, all the datasets that have been read into RapidMiner are stored together as an IOObjectCollection.  As a result, all of them need to be joined together into one dataset.  To do this, I used the “Append” operator.  As can be seen above in the “Parameters” window, there is only a single parameter called “data management” available.  I simply left it as “auto”, but there is also a “memory-optimized” and a “speed-optimized” option that can be used depending on your needs.  This will produce a single dataset with all the original individual datasets stacked on top of each other.  However, the first three columns still have the default names of “att1”, “att2”, and “att3”, so the final step is to rename them with more appropriate names.

“Rename” Parameters Window (Click to Open)

First, click on the “Rename” operator to open the “Parameters” window shown above on the right.  Next, click on the down arrow (circled in red) and choose the “old name” of the first parameter you need to rename, and then type the name you want to change it to in the “new name” box (circled in blue).  Finally, click on the “Edit List” button (circled in green) to open the “Edit Parameter List: rename additional attributes” window shown above on the left, repeat the same renaming process for the remaining two attributes, and click “Apply” at the bottom.  An example of the final “tidy” data that comes out of this operator can be seen below.

“Rename” Output Data (Click to Open)

Conclusion

In summary, all of the files needed are read into RapidMiner using a “Loop Files” operator that contains a “Read CSV” operator and a “Generate Attributes” operator.  As each file is read in, the “Loop Files” operator records the name of the file in a macro called “file_name”, and then a new attribute is created from the “file_name” macro using a “Generate Attributes” operator.  All the files are then joined together using the “Append” operator, and the three attributes that were automatically given the names “att1”, “att2”, and “att3” are renamed using the “Rename” operator.

One final note regarding this post is that this was a simple example of how to fix this problem.  As Hadley Wickham states in his paper, “As long as the format for individual records is consistent, this is an easy problem to fix…” and briefly discusses how it can get more complicated by saying this:

A more complicated situation occurs when the dataset structure changes over time. For example, the datasets may contain different variables, the same variables with different names, different file formats, or different conventions for missing values. This may require you to tidy each file individually (or, if you are lucky, in small groups) and then combine them once tidied.

Obviously, a complicated situation like that would require a much more complicated process in RapidMiner, and considerable more time and effort will be required than what is shown in this post for this simple example.  Like with all the other posts I’ve made in this series, these examples are only meant to get you started, but I hope they have helped you with whatever data project you are working on.

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$
PAGE TOP