install.packages(c("nullabor", "tidygraph", "ggraph", "plotly"))
ETC5521 Tutorial 6
Exploring bivariate dependencies
🎯 Objectives
These are exercises in making scatterplots and variations to examine association between two variables, to explore association matrices and networks, and conduct inference on associations.
🔧 Preparation
The reading for this week is Wilke (2019) Ch 12 Visualizing associations. - Complete the weekly quiz, before the deadline! - Install the following R-packages if you do not have them already:
- Open your RStudio Project for this unit, (the one you created in week 1,
ETC5521
). Create a.qmd
document for this weeks activities.
📥 Exercises
Exercise 1: Olympics
We have seen from the lecture that the Athletics category has too many different types of athletics in it for it to be a useful group for studying height and weight. There is another variable called Event
which contains more specific information.
# Read-in data
data(oly12, package = "VGAMdata")
- Tabulate
Event
for just the Sport categoryAthletics
, and decide which new categories to create.
- Create the new categories, in steps, creating a new binary variable for each. The function
str_detect
is useful for searching for text patterns in a string. It also helps to know about regular expressions to work with strings like this. And there are two sites, which are great for learning: Regex puzzles, Information and testing board
- Make several plots to explore the association between height and weight for the different athletic categories, eg scatterplots faceted by sex and event type, with/without free scales, linear models for the different subsets, overlaid on the same plot, 2D density plots faceted by sex and event type, with free scales.
- List what you learned about body types across the different athletics types and sexes.
- If one were use visual inference to check for a different relationship between height and weight across sports how would you generate null data? Do it, and test your lineup with others in the class.
Exercise 2: Exploring associations
Download the bike.rda
file from the geomnet
software site. This network is a summary of the bike trips taken by customers of the bike sharing company Capital Bikeshare () during the second quarter of 2015. Only trips between stations in the vicinity of Rockville, MD, are included. The data is organized as a list of two datasets, vertices (stations) and edges (trips between stations), as follows:
A list of two data frames:
trips
: the trips data set consists of four variables of length 53:- Start.station: Station where bike trip starts
- End.station: Station where bike trip ends
- n: Number of trips between the two stations
- minlength: Duration of shortest trip between the two stations (in seconds). Only those stations are included, if the shortest trip between them lasted not more than 15 minutes.
stations
: the vertices data set consists of five variables with information on 21 stations:- id: Station ID number
- name: Station name
- lat: Latitude of station location
- long: Longitude of station location
- nbDocks: Number of bike docks at the station
Imagine you are the bike company, and you are interested in learning how to manage keeping bikes in places where people will use them.
a. Make some summaries of the bike station data.
b. Make some summary of the trips data.
c. Make an interactive heatmap
This is to understand the number of trips from one station to another. This will be examining where bikes typically are rented from and where they are left. You need to make sure you have a complete association matrix in order to do this.
d. Represent the association as an interactive network
You can generate a layout based on the number of trips, or you can use the geographic location of the bike stations.
👌 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.