Tutorial and Assignment 2 - Data Cleaning
In this tutorial you will use Google Refine to clean a dataset. We will perform some cleaning together in class. You will also learn how to use R to prepare your data for better analysis with Tableau.
You should submit the completed assignment to us before 23:00 on Monday, September 28th (details below).
Getting Started
Install Google Refine. You can install download and install from here:
http://openrefine.org/download.html
You should install the version called "Google Refine 2.5 - Stable version" at the top of the page. (In case you're confused about the name - the product was originally created by Google, but in the past year it has been open-sourced - hence the new name "Open Refine". However, we will still be using the last Google-branded version, since it is more reliable than the current open-source builds.)
The documentation for Google Refine / Open Refine is available here.
There are also a set of nice introductory tutorials available on YouTube: Part 1, Part 2, Part 3
Here are helpful pointers to the Open Refine Expression Language
Files
universityData.csv - A file containing sample data we will use in the tutorial.
Assignment
The following image shows the rides in the DinoFun World:
Optical Character Recognition has been applied to the map. It extracted the following text file of attraction names, park area names, attraction numbers, and attraction categories:
Your assignment is to load the txt file into Google Refine and process itto remove any formatting problems, including:
- removing empty rows and rows containing only words with uppercase letters. These upper case letter words are mostly the names of park areas. These should be in a separate column.
- removing any extra whitespace and miscellaneous characters (quotes, square braces, etc.) from your columns if there are any
- creating a new column that contains only the attraction names and another one that contains attraction numbers
- fixing some OCR errors in attraction names (you can do this by hand)
- creating a new column that includes the attraction category, removing the category names in the rows
The dataset you were given is quite small - so in theory you could complete the assignment by manual cleaning. However - this will not give you any points for the assignment. Except for a few minor manual cleaning steps you should be using built-in operations of Open Refine such as creating new columns based on existing columns, looking at facets and filtering, transforming cells, etc. (Keep a typed list of the operations you performed, and submit it along with the cleaned data files.)
Your final result should look similar to this (or better)
In class you will not have learned everything you need to know to complete the assignment. I expect that you will watch the tutorials and read documentation. In addition, here is a little help for you in case you would like to use regular expressions:
You might find the following commands useful for the assignment:
isNotNull(value.match(/^[^a-z]*$/))
will return if a given string in a cell is uppercase or consists only of numbers or other special characters
isNull(value.match(/.*[0-9].*/))
will return if a string doesn't contain a digit
if(isNull(value.match(/.*[0-9].*/)), value, "")
returns the value of a cell if it doesn't contain a cell or an empty string otherwise
When you are done with each file, export it as a new CSV.
Once you've finished, you should also extract JSON scripts containing all of the operations you performed on each of the files. (Select Extract at the top of the Undo/Redo tab. Then copy and paste the JSON script into a new file in your text editor.)
Submitting the Assignment
WHAT - You should submit a single ZIP file called "YOUR_NAME-Assignment2.zip" via email. It should contain:
- A text file named "YOUR_NAME-Assignment2.txt" containing (a) your name (b) a list containing a short description for each of the main operations you performed (no need to write down when you fixed a typo).
- A CSV file named "YOUR_NAME-Assignment2-OCR.csv" containing the cleaneddata.
- A JSON file named "YOUR_NAME-Assignment2-OCR.json" containing the operations you used to clean the data.
WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment2.
WHEN - Remember that Assignment 2 is due before "23:00 on Monday, February 29.'''