ETC5521: Diving Deeply into Data Exploration

Exploring bivariate dependencies

Professor Di Cook

Department of Econometrics and Business Statistics

The world is full of obvious things which nobody by any chance observes



Quote from Arthur Conan Doyle, The Hound of the Baskervilles

The story of the galloping horse

Galloping horses throughout history were drawn with all four legs out.

Baronet, 1794 Derby D’Epsom 1821



Read more in Lankester: The Problem of the Galloping Horse

The story of the galloping horse

With the birth of photography, and particular motion photography, Muybridge was able to illustrate that all four legs are never extended simultaneously.


Source: wikimedia


Source: wikimedia

An evolution in seeing the world (1/3)



Mrs Robinson says Hills are more interesting than that. Usually you can see valleys and shadows.

so I put some shadows on.



Mrs Robinson looks horrified!

Go outside and take a look at the hills in the distance



An evolution in seeing the world (2/3)

Drawing lemons


Is something is missing?



My sketch



Notice the yellow reflection(s) and shine on the skin?

An evolution in seeing the world (3/3)

Drawing trees



Does it look like a tree? What is not realistic?

Source: https://toppng.com

Tree foliage has lots of different colours


Philosophical reflection


You, me, we humans have a tendency to

  • only see what other people have done or say,
  • not what we can see,
  • or impose beliefs, like trees are green.


When you look at data, you might discover that there is a different story, or many different stories.

Try to see with fresh eyes

Outline

  • The humble but powerful scatterplot
  • Additions and variations
  • Transformations to linearity
  • (Robust) numerical measures of association
  • Simpson’s paradox
  • Making null samples to test for association
  • Imputing missings

The scatterplot

Scatterplots are the natural plot to make to explore association between two continuous (quantitative or numeric) variables.

They are not just for linear relationships but are useful for examining nonlinear patterns, clustering and outliers.

We also can think about scatterplots in terms of statistical distributions: if a histogram shows a marginal distribution, a scatterplot allows us to examine the bivariate distribution of a sample.

History

  • Descartes provided the Cartesian coordinate system in the 17th century, with perpendicular lines indicating two axes.
  • It wasn’t until 1832 that the scatterplot appeared, when John Frederick Herschel plotted position and time of double stars.
  • Kopf argues that The scatter plot, by contrast, proved more useful for scientists, but it clearly is useful for economics today.

http://www.datavis.ca/milestones/

Language and terminology

Are the words “correlation” and “association” interchangeable?

In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Wikipedia



If the relationship is not linear, call it association, and avoid correlated.

Features of a pair of continuous variables (1/3)

Feature Example Description
positive trend Low value corresponds to low value, and high to high.
negative trend Low value corresponds to high value, and high to low.
no trend No relationship
strong Very little variation around the trend
moderate Variation around the trend is almost as much as the trend
weak A lot of variation making it hard to see any trend

Features of a pair of continuous variables (2/3)

Feature Example Description
linear form The shape is linear
nonlinear form The shape is more of a curve
nonlinear form The shape is more of a curve
outliers There are one or more points that do not fit the pattern on the others
clusters The observations group into multiple clumps
gaps There is a gap, or gaps, but its not clumped

Features of a pair of continuous variables (3/3)

Feature Example Description
barrier There is combination of the variables which appears impossible
l-shape When one variable changes the other is approximately constant
discreteness Relationship between two variables is different from the overall, and observations are in a striped pattern
heteroskedastic Variation is different in different areas, maybe depends on value of x variable
weighted If observations have an associated weight, reflect in scatterplot, e.g. bubble chart

Additional considerations (Unwin, 2015):

  • causation: one variable has a direct influence on the other variable, in some way. For example, people who are taller tend to weigh more. The dependent variable is conventionally on the y axis. It’s not generally possible to tell from the plot that the relationship is causal, which typically needs to be argued from other sources of information.
  • association: variables may be related to one another, but through a different variable, eg ice cream sales are positively correlated with beach drownings, is most likely a temperature relationship.
  • conditional relationships: the relationship between variables is conditionally dependent on another, such as income against age likely has a different relationship depending on retired or not.

Famous data examples

Famous scatterplot examples

Anscombe’s quartet

All four sets of Anscombe has same means, standard deviations and correlations, \(\bar{x}\) = 9, \(\bar{y}\) = 7.5, \(s_x\) = 3.3, \(s_y\) = 2, \(r\) = 0.82.


Numerical statistics are the same, for very different association.

Datasaurus dozen

And similarly all 13 sets of the datasaurus dozen have same means, standard deviations and correlations, \(\bar{x}\) = 54, \(\bar{y}\) = 48, \(s_x\) = 17, \(s_y\) = 27, \(r\) = -0.06.

Scatterplot case studies

Case study: Olympics

  • Note: Warning message: Removed 1346 rows containing missing values (geom_point)
  • Features:
    • linear relationship (expected, more than?)
    • outliers
    • discretization
  • Substantial overplotting, >10000 athletes.
  • What is interesting? Are there some sport(s) where you would expect specific relationships?
skimr::skim(oly12)
Data summary
Name oly12
Number of rows 10384
Number of columns 14
_______________________
Column type frequency:
Date 1
factor 6
numeric 7
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
DOB 6192 0.4 1947-06-01 1997-07-09 1986-09-11 2149

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Name 0 1 FALSE 10366 Lei: 3, Lin: 3, Ale: 2, Hao: 2
Country 0 1 FALSE 205 Gre: 523, Uni: 518, Rus: 414, Aus: 399
Sex 0 1 FALSE 2 M: 5756, F: 4628
PlaceOB 0 1 FALSE 4108 emp: 2690, Seo: 57, Bud: 54, Mos: 50
Sport 0 1 FALSE 42 Ath: 2119, Swi: 907, Foo: 596, Row: 524
Event 0 1 FALSE 763 Men: 336, Wom: 260, Wom: 210, Men: 206

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1.00 26.07 5.44 13.0 22.0 25.0 29.0 71.0 ▆▇▁▁▁
Height 561 0.95 1.77 0.11 1.3 1.7 1.8 1.9 2.2 ▁▃▇▃▁
Weight 1280 0.88 72.85 16.07 36.0 61.0 70.0 81.0 218.0 ▇▆▁▁▁
Gold 0 1.00 0.02 0.14 0.0 0.0 0.0 0.0 2.0 ▇▁▁▁▁
Silver 0 1.00 0.02 0.13 0.0 0.0 0.0 0.0 2.0 ▇▁▁▁▁
Bronze 0 1.00 0.02 0.14 0.0 0.0 0.0 0.0 2.0 ▇▁▁▁▁
Total 0 1.00 0.05 0.25 0.0 0.0 0.0 0.0 5.0 ▇▁▁▁▁
ggplot(oly12, aes(x = Height, y = Weight, label = Sport)) +
  geom_point()

🧩 Try this

Interactivity can be a useful tool for exploring relationships.

Cut and paste the code into your R console, and the resulting plot to examine the sport of the athlete.

library(tidyverse) 
library(plotly) 
data(oly12, package = "VGAMdata") 
p <- ggplot(oly12, aes(x = Height, y = Weight, label = Sport)) + 
  geom_point() 
ggplotly(p) 

How many athletes in the different sports?

Categories need re-working:

  • so many different events grouped into athletics
  • cycling split among many categories

Consolidate factor levels

There are several cycling events that are reasonable to combine into one category. Similarly for gymnastics and athletics.


oly12 <- oly12 |>
  mutate(Sport = as.character(Sport)) |>
  mutate(Sport = ifelse(grepl("Cycling", Sport), 
    "Cycling", Sport
  )) |> 
  mutate(Sport = ifelse(grepl("Gymnastics", Sport),
    "Gymnastics", Sport
  )) |>
  mutate(Sport = ifelse(grepl("Athletics", Sport),
    "Athletics", Sport
  )) |>
  mutate(Sport = as.factor(Sport))

Drill down the by sport

What is interesting?

  • Some sports have no data for height, weight
  • The positive association between height and weight is visible across sports
  • Nonlinear in wrestling?
  • An outlier in judo, and football, and archery
  • Maybe flatter among swimmers
  • Taller in basketball, volleyball and handball
  • Shorter in athletics, weightlifting and wrestling
  • Little variance in tennis players
  • There’s a lot to digest

Refine plots to make comparisons easier:

  • Remove sports with missings
  • Overlay model to assess linear relationships
  • Drill down into other strata, eg male/female athletes
  • Compare in small chunks, one group against the rest
ggplot(oly12, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.5) + 
  facet_wrap(~Sport, ncol = 8) +
  theme(aspect.ratio = 1) 



Note: alpha transparency, and aspect ratio

Remove missings, explore difference by sex

Note: Because the focus is now on males vs females association shape within sport, make plots scale separately.

  • Athletics category should have been broken into several more categories like track, field: a shot-putter has a very different physique to a sprinter.
  • Generally, clustering of male/female athletes
  • Outliers: a tall skinny male archer, a medium height very light female athletics athlete, tall light female weightlifter, tall light male volleyballer
  • Canoe slalom athletes, divers, cyclists are tiny.
oly12 |>
  filter(!(Sport %in% c("Boxing", "Gymnastics", "Synchronised Swimming", "Taekwondo", "Trampoline"))) |>
  mutate(Sport = fct_drop(Sport)) |>
  ggplot(aes(x = Height, y = Weight, colour = Sex)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~Sport, ncol = 7, scales = "free") +
  scale_colour_brewer("", palette = "Dark2") +
  theme(aspect.ratio = 1, axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Common ways to augment scatterplots

Modification Example Purpose
alpha-blend alleviate overplotting to examine density at centre
model overlay focus on the trend
model + data trend plus variation
density overall distribution, variation and clustering
filled density high density locations in distribution (modes), variation and clustering
colour relationship with conditioning and lurking variables

Comparing association



  • Weightlifters are much heavier relative to height
  • Swimmers are leaner relative to height
  • Tennis players are a bit mixed, shorter tend to be heavier, taller tend to be lighter
oly12 |>
  filter(Sport %in% c(
    "Swimming", "Archery", "Basketball",
    "Handball", "Hockey", "Tennis",
    "Weightlifting", "Wrestling"
  )) |>
  filter(Sex == "F") |>
  mutate(Sport = fct_drop(Sport), Sex = fct_drop(Sex)) |>
  ggplot(aes(x = Height, y = Weight, colour = Sport)) +
  geom_smooth(method = "lm", se = FALSE) + 
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom",
    legend.direction = "horizontal"
  )

Comparing spread



  • Modern pentathlon athletes are uniformly height and weight related
  • Shooters are quite varied in body type
oly12 |>
  filter(Sport %in% c("Shooting", "Modern Pentathlon", "Basketball")) |> 
  filter(Sex == "F") |>
  mutate(Sport = fct_drop(Sport), Sex = fct_drop(Sex)) |>
  ggplot(aes(x = Height, y = Weight, colour = Sport)) +
  geom_density2d() + 
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom",
    legend.direction = "horizontal"
  )

Case study: Olympics

We learned that association between height and weight is different strata, defined by categorical variables: sport, gender, and possibly country and age, too.

Some of the association may be due to unmeasured variables, for example, “Athletics” is masking different body types in throwing vs running. This is a lurking variable.



If you were just given the Height and Weight in this data could you have detected the presence of conditional relationships?

It may appear as multimodality.

Can you see conditional dependencies?



There is a barely hint of multimodality.

It’s not easy to detect the presence of the additional variable, and thus accurately describe the relationship between height and weight among Olympic athletes.

p1 <- oly12 |>
  filter(Sport == "Athletics") |>
  ggplot(aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.2, size = 4) 
p2 <- oly12 |>
  filter(Sport == "Athletics") |>
  ggplot(aes(x = Height, y = Weight)) +
  geom_density2d_filled() +
  theme(legend.position = "none")
p3 <- oly12 |>
  filter(Sport == "Athletics") |>
  ggplot(aes(x = Height, y = Weight)) +
  geom_density2d(binwidth = 0.01) 
p4 <- oly12 |>
  filter(Sport == "Athletics") |>
  ggplot(aes(x = Height, y = Weight)) +
  geom_density2d(binwidth = 0.001, color = "white", size = 0.2) +
  geom_density2d_filled(binwidth = 0.001) +
  theme(legend.position = "none")
grid.arrange(p1, p3, p2, p4, ncol = 2)

Numerical measures of association

Correlation

  • Correlation between variables \(x_1\) and \(x_2\), with \(n\) observations in each.

\[r = \frac{\sum_{i=1}^n (x_{i1}-\bar{x}_1)(x_{i2}-\bar{x}_2)}{\sqrt{\sum_{i=1}^n(x_{i1}-\bar{x}_1)^2\sum_{i=1}^n(x_{i2}-\bar{x}_2)^2}} = \frac{\mbox{covariance}(x_1, x_2)}{(n-1)s_{x_1}s_{x_2}}\]

  • Test for statistical significance, whether population correlation could be 0 based on observed \(r\), using a \(t_{n-2}\) distribution:

\[t=\frac{r}{\sqrt{1-r^2}}\sqrt{n-2}\]

cor(d1$x, d1$y)
[1] 0.52
cor.test(d1$x, d1$y)

    Pearson's product-moment correlation

data:  d1$x and d1$y
t = 9, df = 198, p-value = 2e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.41 0.62
sample estimates:
 cor 
0.52 

Problems with correlation (1/2)

cor(d2$x, d2$y)
[1] -0.05
cor.test(d2$x, d2$y)

    Pearson's product-moment correlation

data:  d2$x and d2$y
t = -0.7, df = 198, p-value = 0.5
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.187  0.089
sample estimates:
  cor 
-0.05 


It does not summarise non-linear associations.

Problems with correlation (2/2)

All observations

$estimate
cor 
0.3 

$statistic
  t 
4.4 

$p.value
[1] 1.6e-05

Without outlier

$estimate
   cor 
-0.012 

$statistic
    t 
-0.17 

$p.value
[1] 0.87


It is affected by extreme values.

Perceiving correlation

Let’s play a game: Guess the correlation!


Generally, people don’t do very well at this task. Typically people under-estimate \(r\) from scatterplots, particularly when it is around 0.4-0.7. The variation in a scatterplot perceptually doesn’t vary is not linearly with \(r\).

When someone says correlation is 0.5 it sounds impressive. BUT when someone shows you a scatterplot of data that has correlation 0.5, you will say that’s a weak relationship.

set.seed(7777)
vc <- matrix(c(1, 0, 0, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p1 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.4, 0.4, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p2 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.6, 0.6, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p3 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.8, 0.8, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p4 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.2, -0.2, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p5 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.5, -0.5, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p6 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.7, -0.7, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p7 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.9, -0.9, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p8 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 4)

Robust correlation measures (1/2)

  • Spearman (based on ranks)
    • Sort each variable, and return rank (of actual value)
    • Compute correlation between ranks of each variable
set.seed(60)
df <- tibble(
  x = c(round(rnorm(5), 1), 10),
  y = c(round(rnorm(5), 1), 10)
) |>
  mutate(xr = rank(x), yr = rank(y))
df
# A tibble: 6 × 4
      x     y    xr    yr
  <dbl> <dbl> <dbl> <dbl>
1   0.7  -1.7     5     1
2   0.5   1.1     4     5
3  -0.6   0.3     2     3
4  -0.2  -0.9     3     2
5  -1.7   0.4     1     4
6  10    10       6     6
cor(df$x, df$y)
[1] 0.94
cor(df$xr, df$yr)
[1] 0.2
cor(df$x, df$y, method = "spearman")
[1] 0.2

Robust correlation measures (2/2)

  • Kendall \(\tau\) (based on comparing pairs of observations)
    • Sort each variable, and return rank (of actual value)
    • For all pairs of observations \((x_i, y_i), (x_j, y_j)\), determine if concordant, \(x_i < x_j, y_i < y_j\) or \(x_i > x_j, y_i > y_j\), or discordant, \(x_i < x_j, y_i > y_j\) or \(x_i > x_j, y_i < y_j\).

\[\tau = \frac{n_c-n_d}{\frac12 n(n-1)}\]

cor(df$x, df$y)
[1] 0.94
cor(df$x, df$y, method = "kendall")
[1] 0.067

Comparison of correlation measures

sample corr spearman kendall
0.52 0.512 0.355
-0.05 -0.087 -0.073
0.30 -0.023 -0.014



Robust calculation corrects outlier problems, but nothing measures the non-linear association.

Transformations

for skewness, heteroskedasticity and linearising relationships, and to emphasize association

Circle of transformations for linearising

Remember the power ladder:

-1, 0, 1/3, 1/2, 1, 2, 3, 4


  1. Look at the shape of the relationship.
  2. Imagine this to be a number plane, and depending on which quadrant the shape falls in, you either transform \(x\) or \(y\), up or down the ladder: +,+ both up; +,- x up, y down; -,- both down; -,+ x down, y up


If there is heteroskedasticity, try transforming \(y\), may or may not help

Scatterplot case studies

Case study: Soils (1/4)

Interplay between skewness and association

Data is from a soil chemical analysis of a farm field in Iowa. Is there a relationship between Yield and Boron?


You can get a marginal plot of each variable added to the scatterplot using ggMarginal. This is useful for assessing the skewness in each variable.


Boron is right-skewed Yield is left-skewed. With skewed distributions in marginal variables it is hard to assess the relationship between the two. Make a transformation to fix, first.

Case study: Soils (2/4)



p <- ggplot(
  baker,
  aes(x = B, y = Corn97BU^2)
) + 
  geom_point() +
  xlab("log Boron (ppm)") +
  ylab("Corn Yield^2 (bushells)") +
  scale_x_log10() 
ggMarginal(p, type = "density") 

Case study: Soils (3/4)


Lurking variable?



p <- ggplot(
  baker,
  aes(x = Fe, y = Corn97BU^2)
) +
  geom_density2d(colour = "orange") +
  geom_point() +
  xlab("Iron (ppm)") + 
  ylab("Corn Yield^2 (bushells)")
ggMarginal(p, type = "density")

Case study: Soils (4/4)

Colour high calcium (>5200ppm) calcium values

ggplot(baker, aes(
  x = Fe, y = Corn97BU^2,
  colour = ifelse(Ca > 5200, 
    "high", "low"
  )
)) + 
  geom_point() +
  xlab("Iron (ppm)") +
  ylab("Corn Yield^2 (bushells)") +
  scale_colour_brewer("", palette = "Dark2") +
  theme(
    aspect.ratio = 1,
    legend.position = "bottom",
    legend.direction = "horizontal"
  )

If calcium levels in the soil are high, yield is consistently high. If calcium levels are low, then there is a positive relationship between yield and iron, with higher iron leading to higher yields.

Case study: COVID-19




Bubble plots, size of point is mapped to another variable.

This bubble plot here shows total count of COVID-19 incidence (as of Aug 30, 2020) for every county in the USA, inspired by the New York Times coverage.

load(here("data/nyt_covid.rda"))
usa <- map_data("state")
ggplot() +
  geom_polygon(
    data = usa,
    aes(x = long, y = lat, group = group),
    fill = "grey90", colour = "white"
  ) +
  geom_point(
    data = nyt_county_total,
    aes(x = lon, y = lat, size = cases),
    colour = "red", shape = 1
  ) +
  geom_point(
    data = nyt_county_total,
    aes(x = lon, y = lat, size = cases),
    colour = "red", fill = "red", alpha = 0.1, shape = 16
  ) +
  scale_size("", range = c(1, 30)) +
  theme_map() +
  theme(legend.position = "none")

Scales matter





Where has COVID-19 hit the hardest?

Where are there more people?


This plot tells you NOTHING except where the population centres are in the USA.


To understand relative incidence/risk, report COVID numbers relative the population. For example, number of cases per 100,000 people.

Beyond quantitative variables

When variables are not quantitative

What do you do if the variables are not continuous/quantitative?

Type of variable determines the appropriate mapping.

  • Continuous and categorical: side-by-side boxplots, side-by-side density plots
  • Both categorical: faceted bar charts, stacked bar charts, mosaic plots, double decker plots



Stay tuned!

Paradoxes

Simpsons paradox

There is an additional variable, which if used for conditioning, changes the association between the variables, you have a paradox.

Simpson’s paradox: famous example (1/2)


Did Berkeley discriminate against female applicants?

Example from Unwin (2015)

Simpson’s paradox: famous example (2/2)

Based on separately examining each department, there is no evidence of discrimination against female applicants.

Example from Unwin (2015)

Always examine the associations in each strata

Is what you see really association?

Checking association with visual inference

ggplot(
  lineup(null_permute("Corn97BU"), baker, n = 12),
  aes(x = B, y = Corn97BU)
) +
  geom_point() +
  facet_wrap(~.sample, ncol = 4)

11 of the panels have had the association broken by permuting one variable. There is no association in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association) plots?

data(oly12, package = "VGAMdata")
oly12_sub <- oly12 |>
  filter(Sport %in% c(
    "Swimming", "Archery",
    "Hockey", "Tennis"
  )) |>
  filter(Sex == "F") |>
  mutate(Sport = fct_drop(Sport), Sex = fct_drop(Sex))

ggplot(
  lineup(null_permute("Sport"), oly12_sub, n = 12),
  aes(x = Height, y = Weight, colour = Sport)
) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_brewer("", palette = "Dark2") +
  facet_wrap(~.sample, ncol = 4) +
  theme(legend.position = "none")

11 of the panels have had the association broken by permuting the Sport label. There is no difference in the association between weight and height across sports in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association difference between sports) plots?

Handling and imputing missings (1/2)

Check if missings on one variable are related to distribution of the other variable.

Code
ggplot(oceanbuoys,
       aes(x = air_temp_c,
           y = humidity)) +
     geom_miss_point()

  • Missings plotted in the margins.
  • Missings on humidity only occur for lower values of air tempoerature.

Imputing missings, at least for humidity requires using air temperature values.

But the clustering is due to year

Code
ggplot(oceanbuoys,
       aes(x = air_temp_c,
           y = humidity)) +
     geom_miss_point() +
     facet_wrap(~year, ncol=2) +
     theme(legend.position = "none")

Handling and imputing missings (2/2)

Use the mean of complete cases to impute the missings

Code
ocean_imp_yr_mean <- oceanbuoys |>
  select(air_temp_c, humidity, year) |>
  bind_shadow() |>
  group_by(year) |>
  impute_mean_at(vars(air_temp_c, humidity)) |>
  ungroup() |>
  add_label_shadow()
  
ggplot(ocean_imp_yr_mean,
       aes(x = air_temp_c,
           y = humidity,
           colour = any_missing)) + 
  geom_miss_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  theme(legend.title = element_blank(),
        legend.position = "none")

Use simulation from a bivariate normal distribution, for each year.

Code
ocean_imp_yr_sim <- nabular(oceanbuoys) |>
  select(air_temp_c, humidity, year) |>
  bind_shadow() |>
  add_label_shadow()

# Need to operate on each subset
ocean_imp_yr_sim_93 <- ocean_imp_yr_sim |>
  filter(year == 1993)
ocean_imp_yr_sim_97 <- ocean_imp_yr_sim |>
  filter(year == 1997)

ocean_imp_yr_sim_93 <- VIM::hotdeck(ocean_imp_yr_sim_93) 
ocean_imp_yr_sim_97 <- VIM::hotdeck(ocean_imp_yr_sim_97) 
  
ocean_imp_yr_sim <- bind_rows(ocean_imp_yr_sim_93, ocean_imp_yr_sim_97)  

ggplot(ocean_imp_yr_sim,
       aes(x = air_temp_c,
           y = humidity,
           colour = any_missing)) + 
  geom_miss_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  theme(legend.title = element_blank(),
        legend.position = "none")

Resources

  • Unwin (2015) Graphical Data Analysis with R
  • Graphics using ggplot2
  • Wilke (2019) Fundamentals of Data Visualization https://clauswilke.com/dataviz/
  • Friendly and Denis “Milestones in History of Thematic Cartography, Statistical Graphics and Data Visualisation” available at http://www.datavis.ca/milestones/
  • Tierney et al (2023) Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.