ETC5521 Tutorial 5

Working with a single variable, making transformations, detecting outliers, using robust statistics

Author

Prof. Di Cook

🎯 Objectives

These are exercises in making plots of one variable and what can be learned about the distributions, data patterns and apply randomisation methods to check what we see.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("ggplot2movies", "bayesm",  "ggbeeswarm", "patchwork", "nullabor"))
  • Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: What are the common lengths of movies?

Load the movies dataset in the ggplot2movies package and answer the following questions based on it.

  1. How many observations are in the data?

There are 58788 observations.

  1. Draw a histogram with an appropriate binwidth that shows the peaks at 7 minutes and 90 minutes. Draw another set of histograms to show whether these peaks existed both before and after 1980.
data(movies)
movies |>
  mutate(
    after1980 = ifelse(year > 1980,
      "After 1980",
      "1980 or before"
    ),
    copy = FALSE
  ) |>
  bind_rows(mutate(movies, copy = TRUE, after1980 = "All")) |>
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "yellow", color = "black") +
  scale_x_continuous(
    breaks = c(0, 7, 30, 60, 90, 120, 150, 180),
    limits = c(0, 180)
  ) +
  facet_grid(after1980 ~ .)

  1. The variable Short indicates whether the film was classified as a short film (1) or not (0). Draw plots to investigate what rules was used to define a film as “short” and whether the films have been consistently classified.
movies |>
  group_by(Short) |>
  summarise(x = list(summary(length))) |>
  unnest_wider(x)
# A tibble: 2 × 7
  Short Min.        `1st Qu.`   Median      Mean        `3rd Qu.`   Max.       
  <int> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]>
1     0 1           85          93          95          103         5220       
2     1 1            7          10          14           19          240       

The maximum length for a film classified as short is 240 minutes.

movies |>
  mutate(short = factor(Short, labels = c("Long", "Short"))) |>
  ggplot(aes(y=length, x=short)) +
  geom_quasirandom() +
  scale_y_log10(
    limits = c(1, 240),
    breaks = c(1, 7, 10, 15, 20, 30, 45, 50, 70, 90, 110, 240)
  ) +
  labs(y = "") +
  coord_flip()

From the graph, majority of films classified as short are under 50 minutes while those classified as long tend to be longer than 50 minutes. There are clear cases of mislabelling, e.g. a one-minute long film classified as “not short”.

On further detective work, the original source of the data says “Any theatrical film or made-for-video title with a running time of less than 45 minutes, i.e., 44 minutes or less, or any TV series or TV movie with a running time of less than 22 minutes, i.e. 21 minutes or less. (A”half-hour” television program should not be listed as a Short.)” Given this is an objective measure based on the length of the film, we can see that that any films that are 45 minutes or longer should be classified as long and less than that as short.

  1. How would you use the lineup protocol to determine if the periodic peaks could happen by chance? What would be the null hypothesis? Make your lineup. Does the data plot stand out? Compute the \(p\)-value, if 5 out of 12 people picked the data plot as the most different one in the lineup. Comment on the results. (Note: It might be most useful to assess this only for movies between 50-150 minutes long.)
movies |> 
  filter(between(length, 50, 150)) %>%
  select(length) %>%
  lineup(null_dist("length", "norm"), true=., n=9) %>%
  ggplot(aes(x=length)) +
    geom_histogram(binwidth = 1) +
    scale_x_continuous(
      breaks = c(0, 7, 30, 60, 90, 120, 150, 180),
      limits = c(0, 180)
    ) +
  facet_wrap(~.sample, ncol=3, scales="free") +
  theme(axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank())

Simulate samples from a normal distribution using the sample mean and standard deviation as the parameters.

\(H_0\): Periodic peaks are not present by chance.

I would expect the \(p\)-value to be 0 as the data plot is very clearly different from the nulls. This suggests that the multimodality is not possible to observe in samples from a normal distribution, and that the pattern is present because the movie lengths are commonly cut to specific numbers of minutes.

Exercise 2: What is the market for different brands of whisky?

The Scotch data set in bayesm package was collated from a survey on scotch drinkers, recording the brand they consumed. Take a quick look at the data, and rearrange it to look like:

# A tibble: 21 × 2
   brand                      count
   <chr>                      <int>
 1 Chivas Regal                 806
 2 Dewar's White Label          517
 3 Johnnie Walker Black Label   502
 4 J&B                          458
 5 Johnnie Walker Red Label     424
 6 Other Brands                 414
 7 Glenlivet                    354
 8 Cutty Sark                   339
 9 Glenfiddich                  334
10 Pinch (Haig)                 117
11 Clan MacGregor               103
12 Ballantine                    99
13 Macallan                      95
14 Passport                      82
15 Black & White                 81
16 Scoresby Rare                 79
17 Grant's                       74
18 Ushers                        67
19 White Horse                   62
20 Knockando                     47
21 Singleton                     31
  1. Produce a barplot of the number of respondents per brand. What ordering of the brands do you think is the best? What is interesting about the distribution of counts?
scotch_consumption |>
  mutate(
    brand = fct_reorder(brand, count)) |>
  ggplot(aes(x=count, y=brand)) +
    geom_col() +
    ylab("")

  1. There are 20 named brands and one category that is labelled as Other Brands. Produce a barplot that reduces the number of categories by selecting a criteria to lump certain brands to the Other Brands category.
scotch_consumption |>
  mutate(
    brand = ifelse(count > 200, brand, "Other Brands"),
    brand = fct_reorder(brand, count),
    brand = fct_relevel(brand, "Other Brands")
  ) |>
  ggplot(aes(count, brand)) +
  geom_col()

I’ve chosen the cut-off to be 200 as there was a gap in frequency between brands that sold more than 200 and less than 200. This reduces the comparison to 8 named brands, which is more manageable for comparison.

  1. Think about what a not interesting pattern might be for this data, and formulate an appropriate null hypothesis.

The least interesting pattern is probably if all the bars are similar heights, meaning that all brands are consumed equally.

This leads to a null hypothesis of \(H_o: p_k = 1/K\) where all brands have the same proportion of consumers.

  1. If you were to test whether this sample were consistent with a sample from a multinomial distribution, where all whiskeys were equally popular, how would to generate null samples? Make the lineup for testing this.

The following code might help:

# Subset the data, and anonymmise brand name
scotch_consumption_sub <- scotch_consumption |>
    mutate(
    brand = ifelse(count > 200, brand, "Other Brands")
  ) |>
  filter(brand != "Other Brands") |>
  mutate(brand = as.character(factor(brand, labels=c(1:8)))) 

set.seed(220)
sim <- rmultinom(n=9,
   size=sum(scotch_consumption_sub$count),
   prob=rep(1/8, 8))
sim <- t(sim)
colnames(sim) <- as.character(c(1:8))
sim <- sim |>
  as_tibble() |>
  mutate(.sample=1:9)
sim <- sim |>
  pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
scotch_lineup <- bind_rows(sim,
  bind_cols(.sample=10, scotch_consumption_sub))

# Randomise .sample  to hide data plot
scotch_lineup$.sample <- rep(sample(1:10, 10), rep(8, 10))
  
# Make the lineup
ggplot(scotch_lineup, aes(x=brand, y=count)) +
  geom_col() +
  facet_wrap(~.sample, scales="free", ncol=5) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

We need to simulate samples from a multinomial distribution for the null sets.

If the data was a sample from a uniform distribution, it would mean that all of these brands are typically consumed in equal quantity.

\(H_0:\) The counts for the brands are consistent with a sample from a uniform distribution.

Ideally the bars are sorted from highest to lowest in each plot. This is tricky to do with facetting. The code below will do the sorting and re-draw the lineup.

# Order categories in all samples from highest to lowest: TRICKY
p <- list("p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9", "p10")
for (i in unique(scotch_lineup$.sample)) {
  d <- scotch_lineup |> 
    filter(.sample == i)
  d <- d |>
    mutate(brand = fct_reorder(brand, count))
  p[[i]] <- ggplot(d, aes(y=brand, x=count)) +
    geom_col() +
    ggtitle(i) +
    theme(axis.text = element_blank(),
        axis.title = element_blank())
}

p[[1]] + p[[2]] + p[[3]] + p[[4]] + p[[5]] +
  p[[6]] + p[[7]] + p[[8]] + p[[9]] + p[[10]] + 
  plot_layout(ncol=5)

  1. Suppose you show your lineup to five of people who have not seen the data, and three of them report the data plot as the most different plot. Compute the \(p\)-value. What would these results tell you about the typical consumption of the different whiskey brands?
pvisual(3, 5, 10)
     x simulated  binom
[1,] 3     0.015 0.0086

Likely people will report that the reason for choosing the plot is that one bar is much bigger than the other bars.

It tells us that the data is not a sample from multinomial with equal probabilities, at least in the sense that one brand is consumed more frequently than the others.

  1. This analysis ignored structure in the data, that survey participants could report consuming more than one brand. Have a discussion about what complications this might introduce for the analysis that we have just done. What might be an alternative way to compute the “counts” that takes this multiple responses into account? What else might we want to learn about survey participant responses?

By ignoring the number of responses per participant we have made the assumption that all responses are independent of each other.

One alternative way to compute the counts would be convert each participants’ responses into a fraction of their responses, for example, participant 1,

Scotch[1,]
  Chivas.Regal Dewar.s.White.Label Johnnie.Walker.Black.Label J...B
1            1                   0                          0     0
  Johnnie.Walker.Red.Label Other.Brands Glenlivet Cutty.Sark Glenfiddich
1                        1            1         1          0           1
  Pinch..Haig. Clan.MacGregor Ballantine Macallan Passport Black...White
1            0              0          0        0        0             0
  Scoresby.Rare Grants Ushers White.Horse Knockando the.Singleton
1             0      0      0           0         0             0

reports consuming Chivas.Regal, Johnnie.Walker.Red.Label, Other.Brands, Glenlivet, and Glenfiddich. Each 1 would then be converted to 1/5.

The reason for re-analysing this way is to equally weight participants’ responses, so a participant who responded a lot would contribute the same relative amount to the overall count as a participant who responded very little.

Other questions that might be of interest are:

  • Do some respondents drink a variety of brands and others only few?
  • Do some brands get consumed together more often than others?

You might come up with some other ideas!

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.