ETC5521 Tutorial 6

Working with a single variable, making transformations, detecting outliers, using robust statistics

Author

Prof. Di Cook

Published

26 August 2024

🎯 Objectives

These are exercises in making plots of one variable and what can be learned about the distributions and the data patterns and problems.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("ggplot2movies", "bayesm", "flexmix",  "ggbeeswarm", "mixtools", "lvplot", "patchwork", "nullabor"))

Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Understanding the velocity of galaxies

Load the galaxies data in the MASS package and answer the following questions based on this dataset.

data(galaxies, package = "MASS")

You can access documentation of the data (if available) using the help function specifying the package name in the argument.

help(galaxies, package = "MASS")

What does the data contain? And what is the data source?

Based on the description in the R Help for the data, what would be an appropriate null distribution of this data?

How many observations are there?

If the data is multimodal, which of the following displays do you think would be the best? Which would not be at all useful?

histogram
boxplot
density plot
violin plot
jittered dot plot
letter value plot

Make these plots for the data. Experiment with different binwidths for the histogram and different bandwiths for the density plot. Were you right in your thinking about which would be the best?

Fit your best mixture model to the data, and simulate 19 nulls to make a lineup. Did you do a good job in matching the distribution, ie does the data plot stand out or not? (Extra bonus: What is the model that you have created? Can you make a plot to show how it looks relative to the observed data?)

This code might be helpful to get you started. This code generates a jittered dotplot, but you can use your preferred type from part e.

# Fit a mixture model
library(mixtools)
galaxies_fit <- normalmixEM(galaxies, k=3)

set.seed(1138)
galaxies_sim1 <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)

# Plot your data
ggplot(tibble(galaxies_sim1), aes(x=galaxies_sim1)) +
  geom_quasirandom(aes(x=1, y=galaxies_sim1)) + 
  coord_flip() +
  theme(
    aspect.ratio = 0.7,
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

# Generate null plots and make a lineup
galaxies_null <- tibble(.sample=1, galaxies=galaxies_sim1)
for (i in 2:19) {
  gsim <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)
  galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=i, galaxies=gsim))
}
galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=20,
                                    galaxies=galaxies))
# Randomise .sample  to hide data plot
galaxies_null$.sample <- rep(sample(1:20, 20), rep(82, 20))
ggplot(tibble(galaxies_null), aes(x=galaxies)) +
  geom_quasirandom(aes(x=1, y=galaxies)) + 
  facet_wrap(~.sample, ncol=5) +
  coord_flip() +
  theme(
    aspect.ratio = 0.7,
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

Exercise 2: What are the common lengths of movies?

Load the movies dataset in the ggplot2movies package and answer the following questions based on it.

How many observations are in the data?

Draw a histogram with an appropriate binwidth that shows the peaks at 7 minutes and 90 minutes. Draw another set of histograms to show whether these peaks existed both before and after 1980.

The variable Short indicates whether the film was classified as a short film (1) or not (0). Draw plots to investigate what rules was used to define a film as “short” and whether the films have been consistently classified.

How would you use the lineup protocol to determine if the periodic peaks could happen by chance? What would be the null hypothesis? Make your lineup. Does the data plot stand out? Compute the \(p\)-value, if 5 out of 12 people picked the data plot as the most different one in the lineup. Comment on the results. (Note: It might be most useful to assess this only for movies between 50-150 minutes long.)

Exercise 3: What is the market for different brands of whisky?

The Scotch data set in bayesm package was collated from a survey on scotch drinkers, recording the brand they consumed. Take a quick look at the data, and rearrange it to look like:

# A tibble: 21 × 2
   brand                      count
   <chr>                      <int>
 1 Ballantine                    99
 2 Black & White                 81
 3 Chivas Regal                 806
 4 Clan MacGregor               103
 5 Cutty Sark                   339
 6 Dewar's White Label          517
 7 Glenfiddich                  334
 8 Glenlivet                    354
 9 Grant's                       74
10 J&B                          458
11 Johnnie Walker Black Label   502
12 Johnnie Walker Red Label     424
13 Knockando                     47
14 Macallan                      95
15 Other Brands                 414
16 Passport                      82
17 Pinch (Haig)                 117
18 Scoresby Rare                 79
19 Ushers                        67
20 White Horse                   62
21 Singleton                     31

Produce a barplot of the number of respondents per brand. What ordering of the brands do you think is the best? What is interesting about the distribution of counts?

There are 20 named brands and one category that is labelled as Other.brands. Produce a barplot that you think best reduces the number of categories by selecting a criteria to lump certain brands to the Other category.

Think about what a not interesting pattern might be for this data, and formulate an appropriate null hypothesis.

If you were to test whether this sample were consistent with a sample from a multinomial distribution, where all whiskeys were equally popular, how would to generate null samples? Make the lineup for testing this.

The following code might help:

# Subset the data, and anonymmise brand name
scotch_consumption_sub <- scotch_consumption |>
    mutate(
    brand = ifelse(count > 200, brand, "Other Brands")
  ) |>
  filter(brand != "Other Brands") |>
  mutate(brand = as.character(factor(brand, labels=c(1:8)))) 

set.seed(220)
sim <- rmultinom(n=9,
   size=sum(scotch_consumption_sub$count),
   prob=rep(1/8, 8))
sim <- t(sim)
colnames(sim) <- as.character(c(1:8))
sim <- sim |>
  as_tibble() |>
  mutate(.sample=1:9)
sim <- sim |>
  pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
scotch_lineup <- bind_rows(sim,
  bind_cols(.sample=10, scotch_consumption_sub))

# Randomise .sample  to hide data plot
scotch_lineup$.sample <- rep(sample(1:10, 10), rep(8, 10))
  
# Make the lineup
ggplot(scotch_lineup, aes(x=brand, y=count)) +
  geom_col() +
  facet_wrap(~.sample, scales="free", ncol=5) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

Suppose you show your lineup to five of people who have not seen the data, and three of them report the data plot as the most different plot. Compute the \(p\)-value. What would these results tell you about the typical consumption of the different whiskey brands?

This analysis ignored structure in the data, that survey participants could report consuming more than one brand. Have a discussion about what complications this might introduce for the analysis that we have just done. What might be an alternative way to compute the “counts” that takes this multiple responses into account? What else might we want to learn about survey participant responses?

Exercise 4: What is the best transformation to make?

For each of the variables in the data, which-transform.csv, decide on an appropriate transformation to make the distribution more symmetric for five of the variables and remove discreteness on one variable.

Your tutor has suggestions.

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.