Tutorial and Assignment 2 - Data Cleaning
In this tutorial you will use Google Refine to clean a dataset. We will perform some cleaning together in class.
Getting Started
Install Google Refine. You can install download and install from here:
http://openrefine.org/download.html
You should install the version called "OpenRefine 2.6-rc2 Release Candidate 2" at the top of the page. builds.)''
The documentation for Google Refine / Open Refine is available here.
There are also a set of nice introductory tutorials available on YouTube: Part 1, Part 2, Part 3
Here are helpful pointers to the Open Refine Expression Language
Files
universityData.csv - A file containing sample data we will use in the tutorial.
Assignment
For your assignment you will be working on a slightly expanded version of the csv file you created in the last assignment. This file contains a few more columns:
- Deduped.author.name: a column that lists author names that have been manually cleaned using Jigsaw
- OCR.Title: a title extracted from the pdf of each paper using a data extraction tool called grobid
- OCR.Authors: this field contains authors extracted using grobid from the paper pdfs. If differs from the authors and deduped authors column in that it includes the full first names of the authors. It is ordered like the other two columns as lastname, firstname.
The last two columns are not actually important for this assignment.
For the assignment load the following data file into OpenRefine:
use the following settings upon creating your project (or your work may not be correctly graded):
Your task
Create two new csv files:
File 1 should contain data in this form:
Paper.DOI | Deduped Author Name |
That means, if you have a paper that has three authors, such as: 10.0001.0001, Isenberg,P;Dragicevic,P.;Fekete,J.D the file should look like this:
10.0001.0001 | Isenberg,P. |
10.0001.0001 | Dragicevic,P. |
10.0001.0001 | Fekete,J.D. |
File 2 should contain:
Deduped Author Name | Affiliation |
The file2 should not contain rows with empty data. On file 2 also perform at least 3 different types of cleaning operations as we practiced them in class (or any others you may want to apply). These cleaning operations should be performed on multiple cells at once (single cell cleaning does not count).
Submitting the Assignment
WHAT - You should submit a single ZIP file called "YOUR_LASTNAME-Assignment2.zip" via email. It should contain:
- Two CSV files named "YOUR_LAST_NAME-Assignment2-File-#.csv" containing the cleaned data.
- Two JSON files named "YOUR_LAST_NAME-Assignment2-File-#.json" containing the operations you used to clean the data.
- A txt file called YOUR_LAST_NAME-explanation.txt explaining the cleaning operations you performed
WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment2.
WHEN - Remember that Assignment 2 is due before "23:00 on Wednesday, September 28th.'''