Tutorial 1 - Building a Web Scraper
In this tutorial you build a basic Python web scraper to download and process the "loyalty card" and "credit card" data that you will use to help solve the challenge over the next few weeks. We will build part of the scraper together in class, and you will complete the second part on your own.
You should submit the completed assignment to us before 23:00 on Monday, September 29th (details below).
UPDATE: If you want to re-run the assignment, please change the data urls to: https://www.lri.fr/~isenberg/VA/loyalty and https://www.lri.fr/~isenberg/VA/credit
Getting Started
Install Python 3 (latest version) and the package beautifulsoup
- Download the most recent release of Python 3 from https://www.python.org/downloads/
- Open the Disk image and run the Python 3 installer
- On Mac: Python.mpkg
- On Windows: e.g. python-3.4.1.msi - default settings are fine
Install Beautiful Soup 4
- Open the terminal (mac) or open a command prompt (Windows)
- on Windows navigate to where you installed Python (e.g. by typing
cd c:\Python34
), then typecd Scripts
.
- on Windows navigate to where you installed Python (e.g. by typing
- Use the Python package manager, "pip" (included with Python) to install Beautiful soup by typing:
pip install beautifulsoup4
- If you already had another version of Python installed on your machine before, you may need to force your machine to use the current version of pip by instead typing
pip3.4 install beautifulsoup4
- If you already had another version of Python installed on your machine before, you may need to force your machine to use the current version of pip by instead typing
Optional on Windows:
If you like to work with Visual Studio, install the free Python Tools for Visual Studio from here: http://pytools.codeplex.com/ They give you very nice debugging help when working with Python.
Files
Tutorial #1 Example Scripts: tutorial1_scraper_examples.py
Assignment #1 Starter Code: tutorial1_scraper_assignment.py
In-Class Assignment - "Loyalty Card" Data
Your first task is to scrape and parse the "loyalty card" dataset:
https://www.lri.fr/~wjwillett/temp/Kronos/loyalty/. *updated
To get you started, we have provided an initial Python script that can pull data from the pages and save it to a CSV file.
We will go over how to:
- Modify the Python script and use it to download pages containing the loyalty card records from the website and save them as HTML files to your local machine. (Hint: There are multiple pages of records.)
- Load and parse the HTML files and use BeautifulSoup to extract the records
- Save data to a CSV file.
On Your Own - "Credit Card" Data
Once you have completed the "loyalty card" dataset, you should then modify your scraper to work with the "credit card" data:
https://www.lri.fr/~wjwillett/temp/Kronos/credit/. *updated
Submitting The Assignment
WHAT - Before that time, you should submit a single ZIP file called "{YOUR_NAME}-Assignment1.zip" via email.
- Two CSV files - one containing all of the credit card records and the other containing all of the loyalty card data.
- A file containing of the Python scripts you used to download, parse, and save the code. This code must be clearly commented.
WHERE - You should email the file to wesley.willett@inria.fr
WHEN - Remember that Assignment 1 is due before "23:00 on Monday, September 29th.'''
Python Tips and Tricks
How to get the Windows console to display unicode characters
- Go to the icon in the top left corner, select "Properties", go to the "Fonts" tab, then select the "Lucida Console" font. Press Ok.
- in your console type "chcp 437"
- now type python
- now type print("\u03A9") and you should see a pretty omega symbol
Note: All of this can be avoided if you use the console that comes with Visual Studio's Python tools.