Data Collection

You can’t make visualizations without data, and data coming from the real world is notoriously messy. In this assignment you’re going to practice the basics of accessing data from potentially messy sources, data parsing/cleaning and manipulation. You should cite all resources you utilize and submit individual write-ups.

Choosing a dataset

Choosing a good dataset is an important aspect of this assignment. Throughout this course we will be working on a data analysis and visualization challenge on the topic of Vis4Good. That is, visualization to make an impact on society. For this reason write down 3 topics of societal importance that are of interest to you. Topics might be related to:

  • Drug use
  • Obesity
  • Gender inequality
  • Immigration

...

Go out and find a data source for 1 (!) of your topics that you can either a) scrape or b) retrieve using an API. Datasets that are already formatted as csvs ready for download do not count for this assignment. Make sure to choose data that is of interest to you as you might continue working on it for later assignments.

Advice for Choosing Datasets

  • I advice against using unstructured data from social media streams such as Twitter unless you are very experienced with it. If you use them you need to transform the data into csv files that hold data you are interested in.

Scraping Data From the Web

For the main part of this assignment, you will write the necessary code to request data from a web-based source you found and output the data to a CSV file.

If you like, you may adapt the code you wrote in class. If you choose a data source with an API, you may be able to avoid needing to manually extract data from a web page.

The grading rubric for this portion of the assignment is as follows:

LevelCriteria
Baseline1. Code executes without errors on the student-specified data source.
 2. Data is saved as a properly formatted CSV.
 3. Code is readable, and has appropriate comments.
Average4. The outputted CSV has headers indicating the variable name.
 5. A meaningful index variable appears as the first column.
 6. Code is organized into well-documented functions, and uses external libraries (if appropriate) to avoid redundant work.
Advanced7. The functions can be run successfully on a different (professor-specified) data source from the same domain (i.e. another Wikipedia table, a different table from the Census, etc.). Demonstrate this.
 8. Clear usage documentation is provided.

Responsible Scraping

Before you scrape data from the web, please make sure that you read the website’s ToS (Terms of Service). Some websites don’t allow web scraping of their content. Many (like Wikipedia) allow scraping, provided it is not disruptive to other users. For example, the Wikimedia Foundation https://foundation.wikimedia.org/wiki/Terms_of_Use/en, specifies that “automated uses” like web scraping are not allowed if they are abusive or disruptive of the services. However, scraping a small amount of information for academic purposes hardly meets this criteria.

Basic Rules

  • Check if there is an API. If there is, use it. It will make your life easier.
  • Try not to scrape too much in a short time. This can bog down the servers, and may result in you

being banned from the website.

  • Never scrape anything that is not public.

Data Cleaning

Your next task is to now transform your data into something read for analysis - that is something that looks like one ore multiple tidy data tables. This task might involve the following steps:

  1. If your data is huge (>a few hundred MB) then reduce it in size (e.g. filter out some years, remove unnecessary columns, empty rows, etc.). OpenRefine might not be able to help you with this task and you might have to do this in Python or another language.
  2. If you already have a data table then make sure it's in a tidy (see last lecture) format.

Your second task is to do some data cleaning. Take your transformed data and load it into OpenRefine. Here, inspect the data as we learned in the tutorial and correct data errors. Keep track of the types of possible data errors you found and what changes you made to the dataset. Also save your operations in a .json file so you can reuse them in case you need to change your dataset again.

FAQ

My data does not or cannot contain errors because of how I obtained the data. What should I do?

If your dataset does not or cannot contain errors, then for Task 2 instead look at the distributions of the data variables and potential outliers, see if the data contains what you need for analysis. In your report describe what you see and post some pictures of the distributions (don't do descriptive statistics yet for your data. We will get to that later).

My dataset contains unstructured data and I cannot load it into a program for cleaning

If you collected unstructured data such as tweets then extract metadata for each tweet that you can turn into a tabular format. For example tweets contain metadata on users, locations, likes, ...

I collected more than one dataset that I want to use. What should I do

For the assignment to submit just describe the results for one of the datasets.

Submitting the Assignment


WHAT - To complete the assignment you should:

  1. Submit a single zip file called "YOUR_LASTNAME-Assignment-2.zip" via email.
  2. In the zip file add the code & documentation for Part 1 (scraping) but do not add any scraped data. Make sure I know how to execute your code.
  3. In the zip file also add a 1 page report about any cleaning you performed on the data or outlier/distribution detection if you had to do that instead.

WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment-2.

WHEN - Assignment 2 is due before "23:00 on Oct 7th.'''

Acknowledgements

This assignment borrows from a similar one run by my colleague Jordan Crouser for his SDS235: Visual Analytics class