Exploratory Analysis
By now you will have identified a dataset of interest as well as a couple of candidate research questions. Remember that for the remainder of all assignments leading towards your final project:
- you can but don't have to answer more than one exploratory question
- you can but don't have to use more than one dataset
Whether you choose multiple questions and/or multiple datasets depends on the granularity of your question and the size/completeness of the individual datasets you found.
In this dataset you will perform an exploratory analysis with your team to better understand your data and how it helps you to answer your questions. Your final submission is a report consisting of captioned visualizations that convey key insights gained during your analysis.
Exploratory Analysis
You will perform the exploratory analysis in one or more tool/s such as Tableau, R (with ggplot), or Python (with altair or matplotlib). An exploratory analysis generally consists of two phases, the first of which you should already have completed for the last assignment:
Phase 1
Get an overview of the shape and structure of your dataset. Identify errors and quality problems with the data. Make sure that you look at distributions of variables where it makes sense and that you check whether they make sense. You might have to do cleaning steps again if you find new/more errors or even search for updated data.
Phase 2
Start doing an analysis related to your research question(s). During the analysis take note of any new questions that arise or modifications you might still have to make towards your initial research question(s). For each question you have, start by creating a visualization that might help to provide a (potentially partial) useful answer then update the visualization by adding adding additional variables, sorting the axes, etc. The visualizations should help to check your assumptions, expose anything unexpected in the data, and help you get a better sense of the data you are dealing with. Repeat this process for each question, generate multiple visualizations for each, and feel free to add on new questions and/or revise your original questions as necessary (keep track!). In this phase you will potentially create a lot of visualizations. Doing so is good practice - for the report focus on using those visualizations that show something interesting about your data. For grading the report we will look at the breadth of the question/s explored (e.g. did you look at multiple parts related to your question or stick to just one aspect?), the depth of your analysis (e.g. did you only look at superficial obvious aspects of your question or try to dig deeper?), the clarity of the visualizations created (but since we haven't covered yet what "effective" visualizations are, this will not be graded), and how clearly you communicated your findings in captions and summary (see below).
Your deliverables
WHAT - To complete the assignment you should produce a report that looks roughly as follows::
- A title page with your project topic and team member names & emails
- A section entitled "Data" - here add a short description of your dataset/s - what are they about and where do they come from. Keep this section short. It is meant to serve as a reminder for the person grading your assignment.
- A section entitled "Research Questions" - here make two bulleted lists listing a) the research questions you set out to analyze in this assignment (that is, the questions as they were stated at the end of the last assignment) and b) any research questions you may have added or questions you may have modified
- A section entitled "Discoveries & Insights": Here add 10 (or more) visualizations created with a visualization tool and add descriptive captions for each image. Choose visualizations that show your most important insights, such as surprises or issues as well as visualizations that give answers towards your analysis questions. Each visualization should come with:
- a title
- a caption of 1-4 sentences describing what you learned from the visualization. Provide sufficient detail for each caption so that someone who isn't familiar with your project or data can understand what you learned.
- optional: add annotations to the data where it makes sense to highlight specific aspects you want to draw attention to
- A section entitled "Summary" where you summarize the most important lessons learned from your exploratory analysis
WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment-5.
WHEN - Assignment 5 is due before 23:00 on Nov 4th. = you have roughly 3 weeks to complete it.
FAQ
I want to do a specific analysis / visualization but can't figure out how to create it
Since you have about 3 weeks to complete this assignment I expect you to go out and learn some graphic generation from online material. Resources you might find useful:
- https://www.tableau.com/learn/training
- https://www.perceptualedge.com/blog/?p=2080, some careful inspiration on charts you can consider
- If you would like to learn more about EDA (exploratory data analysis) in R, load swirl
library(swirl)
and runinstall_from_swirl("Exploratory_Data_Analysis")
. I recommend learning ggplot before base graphics as it generates better visualizations right away - see a similar opinion expressed here.
I have graph data, what should I do?
In case of graph data you can still extract some metadata that you can explore in tools such as Tableau that don't do graphs per se. If you would like to see your network structure there are some other tools you can experiment with:
- NodeXL
- Vistorian: partly from our research lab. Use at your own risk.
- https://rawgraphs.io/
- https://gephi.org/
- https://cytoscape.org/