Tutorial 1 -- Building a Webscraper
In this tutorial you build a basic R web scraper to download and process data that you will use to help solve the challenge over the next few weeks. We will build part of the scraper together in class, and you will complete the second part on your own.
You should submit the completed assignment to us before 23:00 on Wednesday, September 21 (details below)
Getting Started
- Install R from this website or from this website (mirrors)
- Install RStudio from its website
Files
- Tutorial #1 Example Script: ScrapingTutorial
- Assignment #1 Starter Code: Tutorial1_scraper_assignment2.R (updated)
In-Class Practice - "Wikipedia" Data
Your first task is to scrape and parse a wikipedia page:
To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file. Download the Tutorial 1 Example Script above. Then we will go over how to:
- Modify the R script and use it to download pages containing wikipedia pages and save them as HTML files to your local machine.
- Load and parse the HTML files and use rvest to extract the records
- Save data to a CSV file, using a coma as a delimiter.
On Your Own - "Publication" Data
Once you have completed the "wikipedia" dataset, you should then modify your scraper to work with the "publication" data that we will be using throughout the course. Download Assignment #1 Starter Code from above and extract the data from this URL:
https://www.lri.fr/~isenberg/VA/vispubdata/
Make sure that all data fields are meaningfully separated so that we can proceed with analyzing the data in the following classes.
Submitting The Assignment
- WHAT You should submit a single ZIP file called "{YOUR_LASTNAME}-Assignment1.zip" via email. This should contain:
- One CSV file - containing all of the publication data you extracted.
- A file containing the R script you used to download, parse, and save the code. This code must be clearly commented.
- WHERE - You should email the file to petra.isenberg@inria.fr
- WHEN - Remember that Assignment 1 is due before "23:00 on Wednesday , September 21.'''