Assignment
For your project you should now have data to start and analyze but you might have to work on it to get it to a manageable size or into a format you can analyze. You might also have to tidy and clean it up. You might also have to reduce it to a more manageable size.
Your tasks
Your first task is to now transform your data into something ready for analysis - that is something that looks like one ore multiple tidy data tables. This task might involve the following steps:
- If your data is huge (>a few hundred MB) then reduce it in size (e.g. filter out some years, remove unnecessary columns, empty rows, etc.). OpenRefine might not be able to help you with this task and you might have to do this in R, Python or another language.
- If you were given unstructured data (tweets, media files, ...) then extract data from them that you might need for analysis. This could require running a sentiment analysis or extracting word counts, metadata, etc.
- If you already have a data table then make sure it's in a tidy format and that your data is clean
Your second task is to do some data cleaning. Take your transformed data and load it into OpenRefine. Here, inspect the data as we learned in the tutorial and correct data errors. Keep track of the types of possible data errors you found and what changes you made to the dataset. Also save your operations in a .json file so you can reuse them in case you need to change your dataset again for your project (for example if you are given an updated dataset later).
FAQ
My data does not or cannot contain errors because of how I obtained the data. What should I do?
If your dataset does not or cannot contain errors, then for Task 2 instead look at the distributions of the data variables and potential outliers, see if the data contains what you need for analysis. In your report describe what you see and post some pictures of the distributions you see for respective data facets.
My dataset contains unstructured data and I cannot load it into a program for cleaning
If you received unstructured data such as tweets then extract metadata for each tweet that you can turn into a tabular format. For example tweets contain metadata on users, locations, likes, ...
I received more than one dataset that I want to use. What should I do
For the report to submit describe the datasets but you only need to clean one. (However, it may be in your best interest to clean all datasets you need for later stages of the project.
Submitting the Assignment
WHAT - You should submit a report about your data transformations called "YOUR_LASTNAMEs-Assignment1.pdf" via email. It should contain the following content:
- Your names, your topic, and a rough idea on the types of questions/tasks the data provider gave you
- Not more than one page explaining the data you received. Mention it's size, what format it has (data tables, graphs, text, etc.), and potential variables contained in the data
- Roughly one page explaining the types of data transformations you had to perform and what the dataset looked like before and after. Show a table with your final datastructure (no need to print out all observations but describe the general structure).
- Roughly one page in which you detail the types of errors in the data you uncovered and how you fixed them. Feel free to add some screenshots. If you find no errors in the data then instead show some distributions of the data values for the different variables in your data.
WHERE - You should email the file to petra.isenberg@inria.fr with the subject InfoVis-Assignment 1.
WHEN - Remember that Assignment 1 is due before end of the day on December 15th.