ETC5521: Diving Deeply into Data Exploration

Making comparisons between groups and strata

Professor Di Cook

Department of Econometrics and Business Statistics

At the heart of quantitative reasoning is a single question: Compared to what?


-Edward Tufte

Making comparisons

  • Groups defined by strata labelled in categorical variables
  • Observations in strata, same or different?
  • Is there a baseline, or normal value?
  • What are the dependencies in the way the data was collected?
  • Are multiple samples recorded for the same individual, or recorded on different individuals?

How would you answer these questions?

  • Are housing prices increasing more in Sydney or Melbourne?
  • Is the quoted price of the unit/apartment I might buy reasonable, or is it too high?
  • Are you more at risk of MPox in Australia or Germany?
  • Is the Alfred or Epworth hospital better for giving birth?
  • It’s hot and dry today, is the risk of bushfires too high to go hiking?

Comparing strata

Case study: Melbourne’s daily maximum temperature (1/2)

Melbourne’s daily maximum temperature from 1970 to 2020.

What are the strata in temporal data?

  • How are the temperatures different across months?
  • What about the temperature within a month?
df9 <- read_csv(here::here("data", "melb_temp.csv")) |>
  janitor::clean_names() |>
  rename(temp = maximum_temperature_degree_c) |>
  filter(!is.na(temp)) |>
  dplyr::select(year, month, day, temp)
skimr::skim(df9)
── Data Summary ────────────────────────
                           Values
Name                       df9   
Number of rows             18310 
Number of columns          4     
_______________________          
Column type frequency:           
  character                2     
  numeric                  2     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────
  skim_variable n_missing complete_rate min max empty
1 month                 0             1   2   2     0
2 day                   0             1   2   2     0
  n_unique whitespace
1       12          0
2       31          0

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate   mean    sd     p0
1 year                  0             1 1995.  14.5  1970  
2 temp                  0             1   19.9  6.48    5.7
     p25    p50    p75   p100 hist 
1 1983   1995   2008   2020   ▇▇▇▇▇
2   14.8   18.6   23.6   46.8 ▃▇▃▁▁
ggplot(df9, aes(x=month, y=temp)) +
  geom_violin(draw_quantiles=c(0.25, 0.5, 0.75), fill= "#56B4E9") +
  labs(x = "month", y = "max daily temp (°C)") +
  theme(aspect.ratio=0.5)

Case study: Melbourne’s daily maximum temperature (2/2)


Why can we make the comparison across months?

Because it is the same location, and same years, for each month subset.


Is some variation in temperature each month due to changing climate?

How would you check this?

Code
df9 |>
  group_by(year, month) |>
  summarise(temp = mean(temp)) |>
  ggplot(aes(x=year, y=temp)) +
  geom_point(alpha=0.5) +
  geom_smooth(se=F) + 
  facet_wrap(~month, ncol=4, scales="free_y") +
  scale_x_continuous("year", breaks=seq(1970, 2020, 20)) +
  ylab("max daily temp (°C)") +
  theme(aspect.ratio=0.7)

What is scales="free_y" for?

Case study: olive oils (1/4)

  • The olive oil data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils.
  • There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria).
data(olives, package = "classifly")
df2 <- olives |>
  mutate(Region = factor(Region, labels = c("South", "Sardinia", "North")))

skimr::skim(df2)
── Data Summary ────────────────────────
                           Values
Name                       df2   
Number of rows             572   
Number of columns          10    
_______________________          
Column type frequency:           
  factor                   2     
  numeric                  8     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 Region                0             1 FALSE          3
2 Area                  0             1 FALSE          9
  top_counts                         
1 Sou: 323, Nor: 151, Sar: 98        
2 Sou: 206, Inl: 65, Cal: 56, Umb: 51

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate   mean    sd   p0
1 palmitic              0             1 1232.  169.   610
2 palmitoleic           0             1  126.   52.5   15
3 stearic               0             1  229.   36.7  152
4 oleic                 0             1 7312.  406.  6300
5 linoleic              0             1  981.  243.   448
6 linolenic             0             1   31.9  13.0    0
7 arachidic             0             1   58.1  22.0    0
8 eicosenoic            0             1   16.3  14.1    1
     p25   p50    p75 p100 hist 
1 1095   1201  1360   1753 ▁▂▇▆▁
2   87.8  110   169.   280 ▂▇▅▃▁
3  205    223   249    375 ▂▇▃▁▁
4 7000   7302. 7680   8410 ▁▇▇▇▁
5  771.  1030  1181.  1470 ▃▅▃▇▃
6   26     33    40.2   74 ▂▅▇▂▁
7   50     61    70    105 ▂▁▇▇▂
8    2     17    28     58 ▇▃▅▂▁
g1 <-
  df2 |>
  mutate(Area = fct_reorder(Area, palmitic)) |>
  ggplot(aes(Area, palmitic, color = Region)) +
  geom_boxplot() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE, x = guide_axis(n.dodge = 2)) +
  theme(aspect.ratio=0.5)

g2 <- ggplot(df2, aes(Region, palmitic, color = Region)) +
  geom_boxplot() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE) +
  theme(axis.text = element_blank())

g3 <- ggplot(df2, aes(palmitic, color = Region)) +
  geom_density() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE) +
  theme(axis.text = element_blank())

g4 <- ggplot(df2, aes(palmitic, color = Region)) +
  stat_ecdf() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE) +
  theme(axis.text = element_blank())

g5 <- g2 + g3 + g4 + plot_layout(ncol=3)
  
g1 + g5 + plot_layout(ncol=1, heights=c(2,1),
  guides = "collect")

Case study: olive oils (2/4)

Colour is generally good to differentiate strata but if there are too many categories then it becomes hard to compare.

ggplot(olives, aes(palmitoleic, palmitic, color = Area)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") 

Case study: olive oils (3/4)

It can be hard to compare across plots, because we need to remember what the previous pattern was when focusing on the new cell.

ggplot(olives, aes(palmitoleic, palmitic, color = Area)) +
  geom_point() +
  facet_wrap(~Area) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE) 

Case study: olive oils (4/4)

Comparison to all, by putting a shadow of all the data underneath the subset in each cell.

ggplot(olives, aes(palmitoleic, palmitic)) +
  geom_point(data = dplyr::select(olives, -Area), color = "gray") +
  geom_point(aes(color = Area), size=2) +
  facet_wrap(~Area) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE)

Strata from quantitative variable

The coplot divides the numerical variable into chunks, and facets by these. The chunks traditionally we overlapping.




Becker, Cleveland and Shyu, (1996); Cleveland (1993)

Code
# Sizing of figure is difficult, so save it
library(ggcleveland)
olives_sard <- df2 |>
  filter(Region == "Sardinia")
p <- gg_coplot(olives_sard,
  x=arachidic, y=oleic, 
  faceting = linoleic,
  number_bins = 6, 
  overlap = 1/4) +
  theme(aspect.ratio=0.5)

Famous example: trade

  • The export from England to the East Indies and the import to England from the East Indies in millions of pounds (A).
  • Import and export figures are easier to compare by plotting the difference like in (B).
  • Relative difference may be more of an interest: (C) plots the relative difference with respect to the average of export and import values.
  • The red area correspond to War of the Spanish Succession (1701-14), Seven Years’ War (1756-63) and the American Revolutionary War (1775-83).
data(EastIndiesTrade, package = "GDAdata")
skimr::skim(EastIndiesTrade)
── Data Summary ────────────────────────
                           Values         
Name                       EastIndiesTrade
Number of rows             81             
Number of columns          3              
_______________________                   
Column type frequency:                    
  numeric                  3              
________________________                  
Group variables            None           

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd   p0
1 Year                  0             1 1740   23.5 1700
2 Exports               0             1  518. 421.   100
3 Imports               0             1 1005. 320.   460
   p25  p50  p75 p100 hist 
1 1720 1740 1760 1780 ▇▇▇▇▇
2  145  370  840 1395 ▇▂▃▂▂
3  835  975 1000 1550 ▃▃▇▁▅
g1 <- ggplot(EastIndiesTrade, aes(Year, Exports)) +
  annotate("rect",
    xmin = 1701, xmax = 1714,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1756, xmax = 1763,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1775, xmax = 1780,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  geom_line(color = "#339933", size = 2) +
  geom_line(aes(Year, Imports), color = "red", size = 2) +
  geom_ribbon(aes(ymin = Exports, ymax = Imports), fill = "gray") +
  labs(y = "<span style='color:#339933'>Export</span>/<span style='color:red'>Import</span>", tag = "(A)") +
  theme(aspect.ratio=0.7, axis.title.y = ggtext::element_markdown())

g2 <- ggplot(EastIndiesTrade, aes(Year, Imports - Exports)) +
  annotate("rect",
    xmin = 1701, xmax = 1714,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1756, xmax = 1763,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1775, xmax = 1780,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  geom_line(size = 2) +
  labs(tag = "(B)") +
  theme(aspect.ratio=0.7)

g3 <- ggplot(EastIndiesTrade, aes(Year, (Imports - Exports) / (Exports + Imports) * 2)) +
  annotate("rect",
    xmin = 1701, xmax = 1714,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1756, xmax = 1763,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  annotate("rect",
    xmin = 1775, xmax = 1780,
    ymin = -Inf, ymax = Inf,
    fill = "red", alpha = 0.3
  ) +
  geom_line(color = "#001a66", size = 2) +
  labs(y = "Relative difference", tag = "(C)") +
  theme(aspect.ratio=0.7)

g1 + g1 + g2 + g3 + plot_layout(ncol=2)

Paired samples

Pairing adjusts for individual differences

If we were wanting to measure the effect of incorporating a data analytics competition on student learning which is the best design?

METHOD A

  • Divide students into two groups. Make sure that each group has similar types of students, so both groups are as similar as possible.
  • One group gets an extra traditional assignment, and the other group participates in a data competition.
  • Each student takes an exam on the content being taught.
  • The scores are compared using side-by-side boxplots and a two-sample permutation test

METHOD B

  • Each student takes an exam on the content being taught. We’ll call this their BEFORE score.
  • Divide students into two groups. Make sure that each group has similar types of students, so both groups are as similar as possible.
  • One group gets an extra traditional assignment, and the other group participates in a data competition.
  • Each student takes an exam on the content being taught. We’ll call this their AFTER score.
  • The difference between the before and after scores are compared using side-by-side boxplots and a two-sample permutation test

What other modifications to the design can you think of?

Case study: choropleth vs hexagon tile (1/3)

The goal is to demonstrate that the hexagon tile map is better than the choropleth for communicating disease incidence across Australia.

The choropleth fills geographic regions (LGAs, SA2s, …) with colour corresponding to the thyroid cancer relative difference from the overall mean. The hexagons, are also filled this way.

Kobakian et al

Case study: choropleth vs hexagon tile (2/3)

  • Each participant can only see the (same) data once.
  • Need to test for different types of spatial patterns.
  • Need to repeat measure each type of pattern, and each participant.

Pairing is done on the data set. Four different data sets used for each pattern.

Trial 1

Participant 1

Participant 2

Trial 2

Participant 1

Participant 2

Case study: choropleth vs hexagon tile (3/3)

Ignore the pairing

Code
hstudy <- read_csv("https://raw.githubusercontent.com/srkobakian/experiment/master/data/DAT_HexmapPilotData_V1_20191115.csv")
hstudy |>
  filter(trend == "three cities") |>
  ggplot(aes(x=detect)) + geom_bar() + facet_wrap(~type, ncol=2)

Looks like detection rate about 50-50 for hexagon tile map, which is better than almost zero for choropleth map.

Account for the pairing

Code
hstudy |>
  filter(trend == "three cities") |>
  select(type, replicate, detect) |>
  group_by(type, replicate) |>
  summarise(pdetect = length(detect[detect == 1])/length(detect)) |>
  ggplot(aes(x=type, y=pdetect)) +
    geom_point() +
    geom_line(aes(group=replicate)) +
    ylim(c(0,1)) +
    xlab("") +
    ylab("Proportion detected")

For each data set, the hexagon tile map performed better.

Normalising

Am I short?

I am 165 cms tall.

Code
ggplot(df22, aes(x=height)) +
  geom_histogram(breaks = seq(132.5, 205, 5),
    colour="white") +
  geom_vline(xintercept = 165, colour="#D55E00",
    linewidth=2)

But, there are strata in humans, so compared to what would be better?

Code
ggplot(df22, aes(x=height)) +
  geom_histogram(breaks = seq(137.5, 205, 5),
    colour="white") +
  geom_vline(xintercept = 165, colour="#D55E00",
    linewidth=2) +
  facet_wrap(~sex, ncol=3, scales="free_y")

Nope, I’m average height.

Normalising

Within each strata convert values to a z-score.

\[ z = \frac{x-\bar{x}}{s} \]

df22 <- df22 |>
  group_by(sex) |>
  mutate(zscore = (height -
    mean(height))/sd(height))

  • females: \(\bar{x}=\) 164.43, \(s=\) 6.41
  • males: \(\bar{x}=\) 174.23, \(s=\) 8.71
  • unknown: \(\bar{x}=\) 165.74, \(s=\) 18.15

My z-score is 0.09.


Rob’s height is 170 cms. His z-score is -0.49.

I am relatively TALLER than Rob.

Baselines

Relative to a baseline

Code
data(anorexia, package="MASS")
ggplot(data=anorexia, 
 aes(x=Prewt, y=Postwt, 
    colour=Treat)) + 
 coord_equal() +
 xlim(c(70, 110)) + ylim(c(70, 110)) +
 xlab("Pre-treatment weight (lbs)") +  
 ylab("Post-treatment weight (lbs)") +
 geom_abline(intercept=0, slope=1,  
   colour="grey80", linewidth=1.25) + 
 geom_density2d() + 
 geom_point(size=3) +
 facet_grid(.~Treat) +
 theme(legend.position = "none")

  • Primary comparison is before treatment weight, and after treatment weight.
  • Three different treatments.

Unwin, Hofmann and Cook (2013)

Code
ggplot(data=anorexia, 
  aes(x=Prewt, colour=Treat,
    y=(Postwt-Prewt)/Prewt*100)) + 
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Percent increase in weight") +
  geom_hline(yintercept=0, linewidth=1.25, 
    colour="grey80") + 
  geom_point(size=3) +   
  facet_grid(.~Treat) +
 theme(legend.position = "none")

  • Compute the difference
  • Compare difference relative to before weight
  • Before weight is used as the baseline

Proportions

Case study: tuberculosis (1/3)

Code
tb_oz |>
  ggplot(aes(x=year, y=count, fill=sex)) +
    geom_col(position="fill") +
    scale_fill_discrete_divergingx(palette = "Zissou 1") +
    facet_wrap(~age, ncol=6) +
    xlab("") + ylab("proportion")

Primary comparison is sex, relative to yearly trend.

Case study: tuberculosis (2/3)

Code
tb_oz |>
  ggplot(aes(x=year, y=count, fill=age)) +
    geom_col(position="fill") +
    scale_fill_discrete_divergingx(palette = "Zissou 1") +
    facet_wrap(~sex, ncol=2) +
    xlab("") + ylab("proportion")

Primary comparison is age, relative to yearly trend.

Case study: tuberculosis (3/3)

Code
tb_oz |>
  ggplot(aes(x=year, y=count, fill=age)) +
    geom_col() +
    scale_fill_discrete_divergingx(palette = "Zissou 1") +
    facet_grid(sex~age, scales="free") +
    xlab("") + ylab("count") +
    theme(legend.position = "none")

Primary comparison is year trend, separately for age and sex.

Inference

Bootstrap confidence intervals

  • Confidence intervals show what might happen to estimates with different samples,
  • with the same dependence structure.
  • Sample the current sample, but don’t change anything else.
  • Reason for sampling with replacement, is to keep sample size the same - we know that variance decreases with smaller sample size.


For choropleth vs hexagon tiles, sample participants with replacement.

hstudy |> filter(trend == "three cities") |> count(type, replicate)
# A tibble: 8 × 3
  type      replicate     n
  <chr>         <dbl> <int>
1 Geography         9    10
2 Geography        10    10
3 Geography        11    11
4 Geography        12    11
5 Hexagons          9    11
6 Hexagons         10    11
7 Hexagons         11    10
8 Hexagons         12    10
hstudy_sub <- hstudy |>
  filter(trend == "three cities") |>
  select(id, type, replicate, detect) 

# Function to compute proportions
prop_func <- function(df) {
  df_smry <- df |>
    group_by(type, replicate) |>
    summarise(pdetect = length(detect[detect == 1])/length(detect)) |> 
    ungroup() |>
    pivot_wider(names_from = c(type, replicate),
               values_from = pdetect)
  df_smry
}

nboots <- 100
set.seed(1023)
bsamps <- tibble(samp="0", prop_func(hstudy_sub))
for (i in 1:nboots) {
  samp_id <- sort(sample(unique(hstudy_sub$id),
    replace=TRUE))
  hs_b <- NULL
  for (j in samp_id) {
    x <- hstudy_sub |>
      filter(id == j)
    hs_b <- bind_rows(hs_b, x)
  }
  bsamps <- bind_rows(bsamps,
    tibble(samp=as.character(i), prop_func(hs_b)))
}

bsamps_long <- bsamps |>
  pivot_longer(Geography_9:Hexagons_12, 
    names_to = "treatments", 
    values_to = "pdetect") |> 
  separate(treatments, into=c("type", "replicate"))

ggplot() +
    geom_line(data=filter(bsamps_long, samp != "0"),
      aes(x=type, 
          y=pdetect, 
          group=samp),
     linewidth=0.5, alpha=0.6, colour="grey70") +
    geom_line(data=filter(bsamps_long, 
                     samp == "0"),
      aes(x=type, 
          y=pdetect, 
          group=samp),
     linewidth=2) +
    facet_wrap(~replicate) +
    ylim(c(0,1)) +
    xlab("") +
    ylab("Proportion detected")

Lineups

  • Lineups show what might happen to estimates with null samples, where (by construction) there is no relationship.
  • Thus you need to break dependence structure.


For choropleth vs hexagon tiles, randomise the type of plot each participant received. This breaks any dependence between type and detection rate.

Code
n_nulls <- 11
set.seed(1110)
lsamps <- tibble(samp="0", prop_func(hstudy_sub))
for (i in 1:n_nulls) {

  hs_b <- hstudy_sub |>
    group_by(id) |>
    mutate(type = sample(type)) |>
    ungroup()
  lsamps <- bind_rows(lsamps,
    tibble(samp=as.character(i), prop_func(hs_b)))
}

lsamps_long <- lsamps |>
  pivot_longer(Geography_9:Hexagons_12, 
    names_to = "treatments", 
    values_to = "pdetect") |> 
  separate(treatments, into=c("type", "replicate"))

lsamps_long |> ggplot() +
    geom_line(aes(x=type, 
          y=pdetect, 
          group=replicate)) +
    facet_wrap(~samp, ncol=4) +
    ylim(c(0,1)) +
    xlab("") +
    ylab("Proportion detected")

Take-aways

  • In comparison to what is especially important for exploring observational data.
  • Avoid reporting spurious associations by accounting for dependencies correctly.
  • Adjust for individual variation, by
    • pairing (or multiple repeated measures)
    • relative to a baseline
    • on a standard scale
  • Determining in comparison to what can be hard.

Resources