Tutorial 1 -- Building a Webscraper

In this tutorial you build a basic R web scraper to download and process data. We will build part of the scraper together in class, and you will complete the second part on your own.

Getting Started

If you haven't already done so:

Files

In-Class Practice - "Wikipedia" Data

Your first task is to scrape and parse a wikipedia page:
To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file. Download the Tutorial 1 Example Script above. Then we will go over how to:

  • Modify the R script and use it to download pages containing wikipedia pages and save them as HTML files to your local machine.
  • Load and parse the HTML files and use rvest to extract the records
  • Save data to a CSV file using a comma as the delimiter (that's the default setting for write.csv)

On Your Own - "Publication" Data

Once you have completed the "wikipedia" dataset, you can then modify your scraper to practice with a version of the publication data set that we will be using throughout the course. Download Exercise #2 Starter Code from above and extract the data from this URL:
https://www.lri.fr/~isenberg/VA/vispubdata/

Make sure that all data fields are meaningfully separated so that we can proceed with analyzing the data in the following classes.

Assignment

Head over to the Assignment1 page