Tutorial -- Building a Webscraper
In this tutorial you build a basic python web scraper to download and process data. We will build part of the scraper together in class, and you can complete the second part on your own.
Getting Started
If you haven't already done so:
- Install Python 3.6++
- Install the following packages: pandas, BeautifulSoup, requests
Files
- Exercise #1 Example Script: ScrapingTutorial
- Exercise #2 Starter Code: Scraping Exercise
In-Class Practice - "Wikipedia" Data
Your first task is to scrape and parse a wikipedia page:
To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file. Download the Tutorial 1 Example Jupyter Notebook above. Then we will go over how to:
- Modify the script and use it to download pages containing wikipedia pages and save them as HTML files to your local machine.
- Load and parse the HTML files and use BeautifulSoup to extract the records
- Save data to a CSV file using a comma as the delimiter (that's the default setting for writing pandas dataframes to csv)
On Your Own - "Publication" Data
Once you have completed the "wikipedia" dataset, you can then modify your scraper to practice with a version of the publication data set that I have used in previous versions of this course. Download Exercise #2 Starter Code from above and extract the data from this URL:
https://www.lri.fr/~isenberg/VA/vispubdata/
Make sure that all data fields are meaningfully separated and that it would be ready for analysis.
Assignment
Head over to the AssignmentCollection page and see how scraping might be part of your next assignment