Data Briefs Submitted to the InfoVis 2018/2019 Class


1) Visualizing Fronts of Performance in Athletics

The data are the performances of track and field athletics gathered and published by the International Association of Athletics Federations (IAAF). No matter the scientific or societal domain performance measures come from―e.g., psychology, management, sports, or medicine―they must be recognized as special random variables that violate the basic assumptions of classic statistics. The performance data tabulated by the IAAF are both remarkably rigorous (being based on strictly standardized measurement procedures) and abundant (emanating from many competitors practicing many disciplines, over several decades). While ignoring the tails of distribution (e.g. all time marks beyond 11s in the 100 meters) the tables provide full information on best performances, called "records" in sports. The hope, among other things, is to learn about the shape of fronts of distributions, which, unlike tails, cannot be convex.

Domain: Athletics and, more generally, human performance.

Intended Audience

Anyone interested in probability theory and statistics.

No special skills are required, just a little courage to apply (under my supervision) non-conventional statistical ideas.

Information on the Data

Data Details: Performance records in track and field athletics, including runs, jumps, and throws.

Data Collection: IAAF

Data Example: https://www.iaaf.org/home

Interesting Challenges and Questions About the Data

Challenges: Test with appropriate aggregations and visualizations the hypothesis that performance distributions, unlike distributions of usual measures, have a convex tail on one side and a non-convex front on the other side.

Questions: Sorry, but I can't answer for lack of space to explain the theory from which the predictions to be tested are derived.

Current State of Analysis: I have started to investigate 100m and long jump data.

Additional Material

2) Carnivorous Plants in the Wild

Locations where carnivorous plants grow in the wild, based on pictures taken and posted to social media.

Domain: plant biology

Intended Audience

plant biologists, worldwide group of carnivorous plant enthusiasts

programming language does not matter much (e.g., Python), but the goal should be a nice stand-alone interactive visualization tool

Information on the Data

Data Details: The data contains locations where carnivorous plants (potentially) grow in the wild. The data includes geographic locations, elevation (based on location lookup), names of the plants or the genus, name of the region (non-systematic), date of picture, login of social media account, location duplicates, IDs to find social media URLs

Data Collection: The data was collected manually by searching Panoramio and Flickr for specific keywords, and then checked visually whether the image shows a plant and the location is believable.

Data Example: no, private data

Interesting Challenges and Questions About the Data

Challenges: challenges: data is dirty, data is not consistent (only based on where people travel and where people have access to cameras/Internet), data could contain incorrect location information on purpose (to protect the habitat); goal: derive geographic visualizations, derive elevation-based visualizations, derive genus-based visualizations, compare visualizations with published distribution maps, anonymize location data in geographic visualizations

Questions: where do the plants grow, when do people find them, what are outliers, what data is likely incorrect, can this data be connected to/extended with other information, ...

Current State of Analysis: some simple graphs

Additional Material

https://en.wikipedia.org/wiki/Carnivorous_plant

3) MCF

soccer championship results and attendance

Domain: spor soccer

Intended Audience

soccer clubs for instance, TV channels, journalists, etc.

understanding soccer championship rules

Information on the Data

Data Details: soccer ligue 1 results, rankings, number of points and number of spectators in the stadium for every game

Data Collection: LFP and rsssf mostly

Data Example: ie :http://vernier.frederic.free.fr/Infovis/gapChart/data/Ligue1/2017Aff.csv http://vernier.frederic.free.fr/Infovis/gapChart/data/Ligue1/2017.csv

Interesting Challenges and Questions About the Data

Challenges: What make the spectators to come to see a soccer game ?

Questions: is there a relationship between some teams and their crowd ?

Current State of Analysis: http://vernier.frederic.free.fr/Infovis/gapChart/

Additional Material

http://charlesperin.net/projects/gapchart

4) Safety Incidents

This dataset comes from the Canadian National Energy Board and contains information about every reported safety incident (spills, fires, injuries, etc.) on inter-provincial oil and gas pipelines that has occurred in the last 10 years. The data contains a large number of different attributes for each incident type, which makes them challenging to visualize together. We would like to explore how to create simpler visualizations of pipeline incidents that make this complexity understandable and accessible to members of the public.

Domain: Energy, Safety, Regulation

Intended Audience

Members of the public (Canadian and international) who would like to understand the safety/risk associated with oil and gas pipelines.

General visualization skills are sufficient.

Information on the Data

Data Details: The data table contains information about ~1200 pipeline incidents that have occurred in Canada since 2008. It includes data about the location of each incident, its cause, and its impact.

Data Collection: The data has been collected, cleaned, and publicly shared by the Canadian National Energy Board (NEB) as part of their open data initiatives.

Data Example: https://www.dropbox.com/s/aabbbyscnzbd685/NEB%20Pipeline-Incidents%20%282008-01%20to%202018-06%29.csv?dl=0

Interesting Challenges and Questions About the Data

Challenges: This data contains lots of heterogeneous information about each incident. We would like to explore new visualization designs that allow viewers to clearly see trends relationships between incidents. It may also be

Questions: What is the impact of typical incidents? Are there trends in the data or relationships between types of incidents?

Current State of Analysis: This data has already been visualized as part of several open-data projects with the NEB, but we would like groups to explore alternative visualization designs.

Additional Material

http://www.neb-one.gc.ca/

5) Visualisation and Analysis of data collected on the SWING line of Synchrotron Soleil

The available data correspond to experimental analyses collected at the Synchrotron Soleil on the beam line SWING (https://www.synchrotron-soleil.fr/en/beamlines/swing). The data are Small Angle Xray Scattering spectra organized in sequences, that monitor an in vitro digestion process of plant proteins (rapeseed proteins). The problem here is to design a visualization for helping biologist and physicists in the analysis of such complex sequences of spectra.

Domain: Physics and biology. Physics of decomposition of food during digestion.

Intended Audience

The requested analysis/visualization interface should be developed for physisicists and biologists. This software may be applied to other datasets for dealing with similar data corresponding to other dynamic process. Such a visualization interface may be of general interest for the SAXS/SANS international community which is quite large.

• Programming skills: being able to make interfaces between different software, being able to deal with existing software. • Interest in large facility devices (synchrotron Soleil). • Interest for complex biological process.

Information on the Data

Data Details: The data are Small Angle Xray Scattering spectra organized in sequences. They come from a monitoring of an in vitro digestion process of plant proteins (rapeseed proteins).

Data Collection: Experimental data were collected at the Synchrotron Soleil on the beam line SWING (https://www.synchrotron-soleil.fr/en/beamlines/swing).

Data Example: Will be sent on demand.

Interesting Challenges and Questions About the Data

Challenges: The way researchers proceed now is based on the use of some software: FoxTrot (distributed by Synchrotron Soleil) is used to process the raw data after acquisition (image data in “.nxs” format) and transform them into curves (diffraction spectra in text format). These curves are then « fitted » according to various theoretical models, thanks to another software, called SasView (that already includes some visualisations, a freeware of the SAXS/SANS community). A dedicated common visualization interface will facilitate the interpretation of the results (the spectra sequences and the parameters of the fitted models): we deal with a complex dynamic process (digestion of plant proteins), and with complex mathematical models (plant proteins and enzymes). We thus not only got issues with the visualisation of raw and processed data, but also with the management of the pipeline of processing (when and for what file using FoxTrot and SasView). We need for instance to subtract some background spectra acquired at regular intervals during the experiment, which may differ from one sequence of acquisitions to another. Then we need to fit the spectra using different possible mathematical models. Interpretations are built thanks to the observation of the sequence of spectra and of the sequence of models parameters (the process of digestion is a very complex one, and various phenomena occur in competition or in synergy). It is important to be able to easily revisit some processing, to automate some processing on sequences of spectra and to see the « deformations » of models in an easy and understandable way: compare full sequences, organise them differently, compare initial and final states of two or more different sequences (corresponding to mixtures of proteins, for instance).

Questions: An interface dedicated to sequences of SAXS/SANS spectra, with a clear monitoring of pipe-lines of processing, will be a precious help for the exploitation of large datasets. A versatile way of visualizing the results of fittings (the evolution of parameters of the fitted models) will also be a huge improvement.

Current State of Analysis: Analyses have been performed using the available software (FoxTrot and SasView), then results have been displayed using Matlab plotting, in a rather “manual” way. A first step of the project will be to observe how we manually proceed, for building a possible global interface.

Additional Material

http://Evelyne-lutton.fr/PosterInfoGestBOUE-29mars2017.pdf

6) Eye Tracking Visualization

Visualizing eye tracking data helps to understand where participants look when they are exposed to a stimulus. This allows to make statements about strategies participants apply and find groups of participants with similar strategies.

Domain: Eye tracking, Visualization

Intended Audience

Typically, eye tracking experts use visualizations of eye tracking data. They are not necessarily experts in the domain of visualization and data analysis.

A basic understanding of eye tracking experiments and eye tracking data is helpful but not necessary.

Information on the Data

Data Details: eye tracking data from an evaluation of different visualizations

Data Collection: Eye tracking study we conducted at the University of Stuttgart

Data Example: via Email

Interesting Challenges and Questions About the Data

Challenges: Usually, during an eye tracking experiment a large amount of data is collected, therefore, it is important to visually explore the data and reduce the amount of data using common techniques like aggregation, filtering, pattern retrieval etc.

Questions: What are common patterns in the data, which strategies do participants use, are there common strategies among participants, can we classify participants into groups?

Current State of Analysis: So far the data has not been analyzed in detail, a project for eye tracking visualizations in general is here: http://www.rtgct.fbeck.com/

Additional Material

7) Transition Analysis

Our data shows how people transition from using one method to complete a task to a more efficient one within the context of command selection and the transition from menus to keyboard shortcuts. We are seeking to understand what are the different factors that can motivate users to start this transition and how long it requires until they entirely switch to the new method.

Domain: mixed data, multi-series, annotation

Intended Audience

The audience is researchers who are interested in exploring and understanding data so that they can find patterns of behavior and formulate hypotheses.

We anticipate the users to know web technologies like HTML/CSS/js and maybe python. Having a background in machine learning and ai is a plus.

Information on the Data

Data Details: This dataset contains information about how users' behavior change regarding performance and strategy choice while they selected commands. In total it includes 42 different users, and each user had to choose 14 commands multiple times. So the dataset contains 588 time series in total

Data Collection: The data is from a study conducted by Grossman et al., 2007. In this study, they compared three alternative menu designs that promote keyboard shortcuts.

Data Example: https://www.dropbox.com/s/2no3xgug92p6z5k/tovi_data_processed.csv?dl=0

Interesting Challenges and Questions About the Data

Challenges: This dataset contains multiple time series of different size, and they include mixed data (a.k.a discrete and continuous). We foresee that visualizing the relationships between the time series can be challenging.

Questions: We are interested in understanding how the users' behavior is affected by the different factors, e.g. frequency of command appearance, menu design, the distance between two consecutive appearances of the same command, etc

Current State of Analysis:

Additional Material

8) Ocean Flow Data

Flow dataset from climatic research lab. My purpose is to visualize them correctly in augmented reality. Climatic data can be very complex, requiring many experts to look at them.

Domain: Climatic

Intended Audience

Climatic experts.

Do not know yet. For my PhD I am doing everything in C++ (performances requirement).

Information on the Data

Data Details: This data is about ocean status (temperature, velocity, pressure, etc.).

Data Collection: It comes from another laboratory in germany. I do not know more yet (did not questionize it).

Data Example: https://swift.dkrz.de/v1/dkrz_8656c91ce0734327b6dc867fc5b6b068/Kitware/Agulhas_10.nc?temp_url_sig=459fcd5b7cdbdad4d95ba29525861621c0649700&temp_url_expires=2019-11-07T13:29:18Z readable with paraview, CDI netCDF Reader

Interesting Challenges and Questions About the Data

Challenges: I would like to see if we can visualize the data properly in 3D using stereoscopic technology (like Hololens). Issue about occlusion for example can easily arise.

Questions: What kind of structures are they able to see ? Can they easily get an impression of the data without using analytics tools ?

Current State of Analysis: I just received the dataset, it has to be further investigate.

Additional Material

9) Police Department Incident Reports 2018

This dataset includes the incidents reported to the Police Department of the city of San Francisco. The dataset contains the information of the incidents, the time that the incidents are reported to the police and the police’s reactions/solutions. We are interested not only to see when, where and how these crimes happened in the city, but also the tendency and the potential of the crimes, the reasons behind, the relationships among the unsolved cases, how the incidents time and the reports time (Dual Timeline) influence the cases…

Domain: Public Safety, Socrata, Crime

Intended Audience

Police Department, journalists, detectives

no specific requirement.

Information on the Data

Data Details: This dataset includes police incident reports filed by officers and by individuals through self-service online reporting for non-emergency cases. The dataset can be regarded as two interrelated parts: The incidents and the police’s reactions (reports). The information about the incidents contains temporal information(Date/Time), categories, geographic information( Intersections, longitude, latitude …), the incident id and a description. The information about the police’s responses contains also the temporal data, the report type, the police district, the Analysis Neighborhood and the resolution of the incident at the time of the report.

Data Collection: https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/ This is a dataset hosted by the city of San Francisco. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore San Francisco's Data using Kaggle and all of the data sources available through the San Francisco organization page!

Data Example: https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783/data

Interesting Challenges and Questions About the Data

Challenges: 1. Multivariable dual timeline: the dataset can be regarded as two interrelated parts: The incidents and the reports. These two parts can be both put in a timeline and have some other properties (locations, categories, descriptions). How can we visualise these two class of temporal data. It will be a complicated graph if we try to show these multivariables in a timeline. However sometimes what we need, rather, is to see the relationships between these different properties. 2. The diverging information: in order to find the tendency and the potential of the crimes, we need to find a impressive way to show the relationships, or the trends of convergence and divergence among the events (or certain properties of the events).

Questions: 1. The basic facts: when, where and how these crimes happened in the city. From the journalists' part, they may be interested in the police part's reactions. 2. The incidents are usually reported days (or even a month) after they come up. Is there something happened during these intervals? If people reported on time, whether some of accidents may not happen?

Current State of Analysis: I'm working on a dataset about the turnovers and position changes of the french municipal councillors.

Additional Material

We have also the dataset of the previous year, but in a slightly different structure https://www.kaggle.com/san-francisco/sf-police-calls-for-service-and-incidents

10) Representation of multi-modal networks

The data is about the 4 main public transports in Paris, which are usually seen separated or grouped by Metro/Tram/RER and Buses on the other side. A representation of all networks could help grasp the possibilities given by going from one network to the other to move in Paris.

Domain: Public Transport, Network, Graph

Intended Audience

Network controller / public transport users

Unity skills could reduce the model creation time

Information on the Data

Data Details: GTFS Data about the RATP network in Paris, including buses, metro, RER and Tram. It includes data of position of stops, time of arrival at each stations, road followed, ...

Data Collection: Data from RATP Open Data

Data Example: https://data.ratp.fr/explore/dataset/offre-transport-de-la-ratp-format-gtfs/information/

Interesting Challenges and Questions About the Data

Challenges: Show a visualization of the data using all 3 dimensions, that allows understanding of the different networks, in those data you have 4 main public transport, that can often work in correspondence by switching from one kind of transport to the other, but those correspondence aren't shown inside the data. Furthermore, the density of each network is very different, and only showing every network on the same layer would not help having a good understanding. A further challenge would be to also include the temporal data to further increase accuracy of the visualization.

Questions: What are my options to go from station A to station B ? Which area have poor public transport density ?

Current State of Analysis: Working on visualization of weighted graphs, using the case of Paris metro maps, in AR on top of a Wall-sized display

Additional Material

11) UFO Sightings

I think we can learn on the psychology of human by observing the tendances of this data. How people are probably infuencing each other.

Domain: UFO

Intended Audience

UFOlogist probably but I do not really know.

I do not think that they need particular skills.

Information on the Data

Data Details: This dataset contains over 80,000 reports of UFO sightings over the last century. There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.

Data Collection: They got it from reports back to the 20th century.

Data Example: https://www.kaggle.com/NUFORC/ufo-sightings

Interesting Challenges and Questions About the Data

Challenges: There are no real challenges here, I just think that the idea of observing how people are infuenced by the other is interesting. I like the idea of finding some tendances on the UFO sighting and try to understand what they could be related to.

Questions: I do not really have some but here are some inspirations given by the owner of the dataBase : - What areas of the country are most likely to have UFO sightings? - Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal? - Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers? - What are the most common UFO descriptions?

Current State of Analysis: I have never worked on the data.

Additional Material

https://www.kaggle.com/NUFORC/ufo-sightings

12) Smart home data sets

The data set is about IoT smart home, where inhabitant activities are observed through IoT sensors.

Domain: IoT smart home

Intended Audience

For everyone

Basic machine learning and Java scripts.

Information on the Data

Data Details: IoT data sets, which represent inhabitant`s activity daily living routines. The datasets provide the fundamental ground to identify the hidden pattern of inhabitant´s daily routines. Later, the machine learning approach can be applied for IoT home automation to provide assisted living environments for elderly people.

Data Collection: http://casas.wsu.edu/datasets/

Data Example: http://casas.wsu.edu/datasets/

Interesting Challenges and Questions About the Data

Challenges: Visualization of random forest model to audit how a specific decision is made.

Questions: I want to visualize random forest model.

Current State of Analysis:

Additional Material

13) KDD Cup 1998

This is a dataset available on https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data

Data and the authors and donors are mentioned in the link.

Description: PROJECT OVERVIEW: A Fund Raising Net Return Prediction Model

Domain: Donation/Marketing

Intended Audience

Space for creativity

General programming skills

Information on the Data

Data Details: The data characterizes a donation process through mails having a special gift configuration. Based on this gift configuration the people donated a certain amount of dollars to the non profit organization in charge of the process. The goal is to maximize the net income and identify the a model to improve the cost effectiveness of future direct marketing efforts. https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data

Data Collection: https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data

Data Example: The data can be found in below link. Please select the file : cup98lrn.zip --> https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/

Interesting Challenges and Questions About the Data

Challenges: Treat mixed attributes, and target outputs in binary and real formats. The data has missing values and unbalanced output.

Questions: 1. Overview the data in a convenient manner 2. Unveil different patterns that are hidden in the data

Current State of Analysis:

Additional Material

14) Intrachromosomal contact matrices

Domain is genomics.

Domain: genomics, molecular biology, DNA

Intended Audience

N/A

Some programming skill. Domain knowledge about biology would be a huge plus.

Information on the Data

Data Details: The data tells us about the interaction between different parts of a chromosome.

Data Collection: NCBI - GEO (Gene Expression Omnibus)

Data Example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525

Interesting Challenges and Questions About the Data

Challenges: Mapping some area of the matrix with the 3D position.

Questions: Determine and show interesting regions of the matrix. Maybe use the matrix to navigate in 3D.

Current State of Analysis:

Additional Material

This is the file in the link above: GSE63525_GM12878_combined_intrachromosomal_contact_matrices.tar.gz