Tutorial 1 - Building a Web Scraper

In this tutorial you build a basic Python web scraper to download and process the "loyalty card" and "credit card" data that you will use to help solve the challenge over the next few weeks. We will build part of the scraper together in class, and you will complete the second part on your own.

You should submit the completed assignment to us before 23:00 on Monday, September 29th (details below).

UPDATE: If you want to re-run the assignment, please change the data urls to: https://www.lri.fr/~isenberg/VA/loyalty and https://www.lri.fr/~isenberg/VA/credit

Getting Started

Install Python 3 (latest version) and the package beautifulsoup

  1. Download the most recent release of Python 3 from https://www.python.org/downloads/
  2. Open the Disk image and run the Python 3 installer
    • On Mac: Python.mpkg
    • On Windows: e.g. python-3.4.1.msi - default settings are fine

Install Beautiful Soup 4

  1. Open the terminal (mac) or open a command prompt (Windows)
    • on Windows navigate to where you installed Python (e.g. by typing cd c:\Python34), then type cd Scripts.
  2. Use the Python package manager, "pip" (included with Python) to install Beautiful soup by typing: pip install beautifulsoup4
    • If you already had another version of Python installed on your machine before, you may need to force your machine to use the current version of pip by instead typing pip3.4 install beautifulsoup4

Optional on Windows:
If you like to work with Visual Studio, install the free Python Tools for Visual Studio from here: http://pytools.codeplex.com/ They give you very nice debugging help when working with Python.

Files

Tutorial #1 Example Scripts: tutorial1_scraper_examples.py

Assignment #1 Starter Code: tutorial1_scraper_assignment.py

In-Class Assignment - "Loyalty Card" Data

Your first task is to scrape and parse the "loyalty card" dataset:

https://www.lri.fr/~wjwillett/temp/Kronos/loyalty/. *updated

To get you started, we have provided an initial Python script that can pull data from the pages and save it to a CSV file.

We will go over how to:

  • Modify the Python script and use it to download pages containing the loyalty card records from the website and save them as HTML files to your local machine. (Hint: There are multiple pages of records.)
  • Load and parse the HTML files and use BeautifulSoup to extract the records
  • Save data to a CSV file.

On Your Own - "Credit Card" Data

Once you have completed the "loyalty card" dataset, you should then modify your scraper to work with the "credit card" data:

https://www.lri.fr/~wjwillett/temp/Kronos/credit/. *updated

Submitting The Assignment

WHAT - Before that time, you should submit a single ZIP file called "{YOUR_NAME}-Assignment1.zip" via email.

  1. Two CSV files - one containing all of the credit card records and the other containing all of the loyalty card data.
  2. A file containing of the Python scripts you used to download, parse, and save the code. This code must be clearly commented.

WHERE - You should email the file to wesley.willett@inria.fr

WHEN - Remember that Assignment 1 is due before "23:00 on Monday, September 29th.'''

Python Tips and Tricks

How to get the Windows console to display unicode characters

  • Go to the icon in the top left corner, select "Properties", go to the "Fonts" tab, then select the "Lucida Console" font. Press Ok.
  • in your console type "chcp 437"
  • now type python
  • now type print("\u03A9") and you should see a pretty omega symbol

Note: All of this can be avoided if you use the console that comes with Visual Studio's Python tools.