ETC5521 Tutorial 6

Exploring bivariate dependencies

Author

Prof. Di Cook

🎯 Objectives

These are exercises in making scatterplots and variations to examine association between two variables, to explore association matrices and networks, and conduct inference on associations.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 12 Visualizing associations. - Complete the weekly quiz, before the deadline! - Install the following R-packages if you do not have them already:

install.packages(c("nullabor", "tidygraph", "ggraph", "plotly"))

Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Olympics

We have seen from the lecture that the Athletics category has too many different types of athletics in it for it to be a useful group for studying height and weight. There is another variable called Event which contains more specific information.

# Read-in data
data(oly12, package = "VGAMdata")

Tabulate Event for just the Sport category Athletics, and decide which new categories to create.

Solution

World Athletics, the sport’s governing body, defines athletics in six disciplines: track and field, road running, race walking, cross country running, mountain running, and trail running wikipedia.

That’s not so helpful! Track and field should be two different groups. I suggest running (short, middle and long distance), throwing, jumping, walking, and Decathlon (men) or Heptathlon (women)

Create the new categories, in steps, creating a new binary variable for each. The function str_detect is useful for searching for text patterns in a string. It also helps to know about regular expressions to work with strings like this. And there are two sites, which are great for learning: Regex puzzles, Information and testing board

Solution

# Give each athlete an id as unique identifier to facilitate relational joins
oly12 <- oly12 %>%
  mutate(id = row_number(), .before = Name)

# For athletes with > 1 event, separate each event into a row
oly12_ath <- oly12 %>%
  filter(Sport == "Athletics") %>%
  separate_rows(Event, sep = ", ")

# Determine athlete types into 7 categories
oly12_ath <- oly12_ath %>%
  mutate(
    Ath_type = case_when(
      # 100m, 110m Hurdles, 200m, 400m, 400m hurdles, 800m, 4 x 100m relay, 4 x 400m relay,
      str_detect(Event, "[1248]00m|Hurdles") ~ "Short distance",
      # 1500m, 3000m Steeplechase, 5000m
      str_detect(Event, "1500m|5000m|Steeplechase") ~ "Middle distance",
      # 10,000m, Marathon
      str_detect(Event, ",000m|Marathon") ~ "Long distance",
      # 20km Race walk, Men's 50km Race walk
      str_detect(Event, "Walk") ~ "Walking",
      # discus throw, hammer throw, javelin throw, shot put,
      str_detect(Event, "Throw|Put") ~ "Throwing",
      # high jump, long jump, triple jump, pole vault
      str_detect(Event, "Jump|Pole Vault") ~ "Jumping",
      # decathlon (men) or heptathlon (women)
      str_detect(Event, "Decathlon|Heptathlon") ~ "Decathlon/Heptathlon"
    )
  )

# Remove rows with > 1 of the same athlete type
oly12_ath <- oly12_ath %>%
  select(-Event) %>%
  distinct()

# Add events back to each athlete
oly12_ath <- oly12_ath %>%
  left_join(
    select(.data = oly12, c(Event, id)),
    by = "id"
  )

Make several plots to explore the association between height and weight for the different athletic categories, eg scatterplots faceted by sex and event type, with/without free scales, linear models for the different subsets, overlaid on the same plot, 2D density plots faceted by sex and event type, with free scales.

Solution

library(colorspace)
ggplot(data = oly12_ath, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.4) +
  facet_grid(Sex ~ Ath_type)

ggplot(data = oly12_ath, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.4) +
  facet_grid(Sex ~ Ath_type, scales="free")

ggplot(oly12_ath, aes(x = Height, y = Weight, colour = Ath_type)) +
  geom_smooth(method = "lm", se = F) +
  scale_colour_discrete_qualitative() +
  facet_wrap(~Sex)

ggplot(data = oly12_ath, aes(x = Height, y = Weight)) +
  geom_density2d(alpha = 0.4) +
  facet_grid(Sex ~ Ath_type, scales="free")

It is important to separate the sexes. Faceting by sport and sex and examining the scatterplots of height and weight in each is appropriate. Making a plot of just the regression models, faceted by sex can be useful also. A density plot is not so useful here because there isn’t a lot of difference in variance between the groups.

List what you learned about body types across the different athletics types and sexes.

Solution

Here are some examples:

From the scatter plots we learn that
- there are some heavy runners, especially in the shorter distances
- the throwers are generally taller and much heavier.
- female walkers tend to be pretty small.
- long distance runners are light!
Decathlon/Heptathlon athletes are usually quite heavy; which makes sense as they have to be all-rounded
- if they’re too light, they may not do well in throwing events
- if they’re too heavy, they may not do well in running or jump events
The comparisons between groups is easier from the models.
- throwers are heavy!
- long/middle distance runners and walkers are relatively light.

If one were use visual inference to check for a different relationship between height and weight across sports how would you generate null data? Do it, and test your lineup with others in the class.

Solution

You would permute the Ath_type variable, keeping sex fixed. You could still facet by sex for the lineup, and use a smaller number of nulls so it’s not too over-whelming to read. Or you could separately make a full lineup of 20 for males and females separately.

library(nullabor)
set.seed(141)
ggplot(lineup(null_permute("Ath_type"),
              true=oly12_ath, n=6), 
       aes(x = Height, 
           y = Weight, 
           colour = Ath_type)) +
  geom_smooth(method = "lm", se = F, 
              fullrange=TRUE) +
  scale_colour_discrete_qualitative() +
  facet_grid(Sex~.sample) +
  theme(legend.position="none",
        axis.text = element_blank(),
        axis.title = element_blank())

If there was no difference between the event types, the the lines would all be a bit varied in slope but intersect in a single point. (Why?) The actual data has several differences: some events have the same slope but are shifted vertically, some have different slopes and are shifted. You don’t see this in the null plots.

Should you re-do the nulls by forcing the intercepts to be the same in the data sample? If so, how would you do this?

Exercise 2: Exploring associations

Download the bike.rda file from the geomnet software site. This network is a summary of the bike trips taken by customers of the bike sharing company Capital Bikeshare () during the second quarter of 2015. Only trips between stations in the vicinity of Rockville, MD, are included. The data is organized as a list of two datasets, vertices (stations) and edges (trips between stations), as follows:

A list of two data frames:

trips: the trips data set consists of four variables of length 53:
- Start.station: Station where bike trip starts
- End.station: Station where bike trip ends
- n: Number of trips between the two stations
- minlength: Duration of shortest trip between the two stations (in seconds). Only those stations are included, if the shortest trip between them lasted not more than 15 minutes.
stations: the vertices data set consists of five variables with information on 21 stations:
- id: Station ID number
- name: Station name
- lat: Latitude of station location
- long: Longitude of station location
- nbDocks: Number of bike docks at the station

Imagine you are the bike company, and you are interested in learning how to manage keeping bikes in places where people will use them.

a. Make some summaries of the bike station data.

b. Make some summary of the trips data.

c. Make an interactive heatmap

This is to understand the number of trips from one station to another. This will be examining where bikes typically are rented from and where they are left. You need to make sure you have a complete association matrix in order to do this.

d. Represent the association as an interactive network

You can generate a layout based on the number of trips, or you can use the geographic location of the bike stations.

Solution

Code

load(here::here("data/bikes.rda"))

ggplot(bikes$stations, 
       aes(x=long, y=lat)) +
  geom_point()

Code

ggplot(bikes$stations, 
       aes(x=fct_reorder(name, nbDocks), y=nbDocks)) +
  geom_col() +
  xlab("") +
  coord_flip()

Code

trips <- bikes$trips |> 
  tibble()

full_list <- expand_grid(Start.station = unique(trips$Start.station), 
                         End.station = unique(trips$End.station)) |> 
  left_join(trips,
            by = c("Start.station" = "Start.station",
                   "End.station" = "End.station")) |> 
  mutate(n = replace_na(n, 0),
         minlength = replace_na(n, 0))

bike_level <- full_list |> 
  group_by(Start.station) |> 
  summarise(n = sum(n)) |> 
  arrange(desc(n)) |> 
  pull(Start.station)

p <- full_list |> 
  mutate(Start.station = factor(Start.station, levels = bike_level),
         End.station = factor(End.station, levels = bike_level)) |> 
  ggplot( 
    aes(x = Start.station,
        y = End.station, fill = n)) +
  geom_tile() +
  scale_fill_continuous_sequential(palette = "YlGnBu") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none")

ggplotly(p, width = 600, height = 600)

Code

stations <- bikes$stations |> 
  tibble() |> 
  select(-id)
  
# graph data structure
bikes_graph <- tbl_graph(nodes = stations, edges = trips)

# kk layout
bikes_graph |> 
  ggraph(layout = "kk") +
  geom_edge_link(aes(edge_alpha = n)) +
  geom_node_point(aes(size = nbDocks)) +
  geom_node_label(aes(label = name), size = 1.5, repel = TRUE,
                  label.padding = 0.15) +
  theme(aspect.ratio=1,
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill=NA, 
                                        colour="black"),
        legend.position = "none")

Code

# Spatial layout
bikes_graph |> 
  ggraph(x = stations$long, y = stations$lat) +
  geom_edge_link(aes(edge_alpha = n)) +
  geom_node_point(aes(size = nbDocks)) +
  geom_node_label(aes(label = name), size = 1.5, repel = TRUE,
                  label.padding = 0.15) +
  theme(aspect.ratio=1,
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill=NA, 
                                        colour="black"),
        legend.position = "none")

Code

# interactive graph
interactive_graph <- bikes_graph |> 
  ggraph(x = stations$long, y = stations$lat) +
  geom_edge_link(aes(edge_alpha = n)) +
  geom_point_interactive(aes(x = x, y = y,
                             size = nbDocks,
                             tooltip = name,
                             data_id = name)) +
  theme(aspect.ratio=1,
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill=NA, 
                                        colour="black"),
        legend.position = "none")

girafe(ggobj = interactive_graph,
       options = list(
         opts_hover(css = "fill:lightblue;stroke:grey;stroke-width:0.5px"),
         opts_zoom(min = 0.5, max = 3)
       ))

Code

# p2 <- ggplot() +
#   geom_point(data=bikes_nodes, 
#              aes(x=x, y=y, label=name)) +
#   geom_segment(data=bikes_edges, 
#                aes(x=x, y=y, xend=xend, yend=yend),
#                linewidth = 0.3) +
#   scale_color_viridis_c() +
#   theme(aspect.ratio=1,
#         axis.text = element_blank(),
#         axis.title = element_blank(),
#         axis.ticks = element_blank(),
#         panel.background = element_rect(fill=NA, 
#                                         colour="black"),
#         legend.position = "none")
# ggplotly(p2, tooltip = "label", width=900, height=600)

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.