install.packages(c("ggplot2movies", "bayesm", "ggbeeswarm", "patchwork", "nullabor"))
ETC5521 Tutorial 5
Working with a single variable, making transformations, detecting outliers, using robust statistics
🎯 Objectives
These are exercises in making plots of one variable and what can be learned about the distributions, data patterns and apply randomisation methods to check what we see.
🔧 Preparation
The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:
- Open your RStudio Project for this unit, (the one you created in week 1,
ETC5521
). Create a.qmd
document for this weeks activities.
📥 Exercises
Exercise 1: What are the common lengths of movies?
Load the movies
dataset in the ggplot2movies
package and answer the following questions based on it.
- How many observations are in the data?
- Draw a histogram with an appropriate binwidth that shows the peaks at 7 minutes and 90 minutes. Draw another set of histograms to show whether these peaks existed both before and after 1980.
- The variable
Short
indicates whether the film was classified as a short film (1
) or not (0
). Draw plots to investigate what rules was used to define a film as “short” and whether the films have been consistently classified.
- How would you use the lineup protocol to determine if the periodic peaks could happen by chance? What would be the null hypothesis? Make your lineup. Does the data plot stand out? Compute the \(p\)-value, if 5 out of 12 people picked the data plot as the most different one in the lineup. Comment on the results. (Note: It might be most useful to assess this only for movies between 50-150 minutes long.)
Exercise 2: What is the market for different brands of whisky?
The Scotch
data set in bayesm
package was collated from a survey on scotch drinkers, recording the brand they consumed. Take a quick look at the data, and rearrange it to look like:
# A tibble: 21 × 2
brand count
<chr> <int>
1 Chivas Regal 806
2 Dewar's White Label 517
3 Johnnie Walker Black Label 502
4 J&B 458
5 Johnnie Walker Red Label 424
6 Other Brands 414
7 Glenlivet 354
8 Cutty Sark 339
9 Glenfiddich 334
10 Pinch (Haig) 117
11 Clan MacGregor 103
12 Ballantine 99
13 Macallan 95
14 Passport 82
15 Black & White 81
16 Scoresby Rare 79
17 Grant's 74
18 Ushers 67
19 White Horse 62
20 Knockando 47
21 Singleton 31
- Produce a barplot of the number of respondents per brand. What ordering of the brands do you think is the best? What is interesting about the distribution of counts?
- There are 20 named brands and one category that is labelled as
Other Brands
. Produce a barplot that reduces the number of categories by selecting a criteria to lump certain brands to theOther Brands
category.
- Think about what a not interesting pattern might be for this data, and formulate an appropriate null hypothesis.
- If you were to test whether this sample were consistent with a sample from a multinomial distribution, where all whiskeys were equally popular, how would to generate null samples? Make the lineup for testing this.
The following code might help:
# Subset the data, and anonymmise brand name
<- scotch_consumption |>
scotch_consumption_sub mutate(
brand = ifelse(count > 200, brand, "Other Brands")
|>
) filter(brand != "Other Brands") |>
mutate(brand = as.character(factor(brand, labels=c(1:8))))
set.seed(220)
<- rmultinom(n=9,
sim size=sum(scotch_consumption_sub$count),
prob=rep(1/8, 8))
<- t(sim)
sim colnames(sim) <- as.character(c(1:8))
<- sim |>
sim as_tibble() |>
mutate(.sample=1:9)
<- sim |>
sim pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
<- bind_rows(sim,
scotch_lineup bind_cols(.sample=10, scotch_consumption_sub))
# Randomise .sample to hide data plot
$.sample <- rep(sample(1:10, 10), rep(8, 10))
scotch_lineup
# Make the lineup
ggplot(scotch_lineup, aes(x=brand, y=count)) +
geom_col() +
facet_wrap(~.sample, scales="free", ncol=5) +
theme(axis.text = element_blank(),
axis.title = element_blank())
- Suppose you show your lineup to five of people who have not seen the data, and three of them report the data plot as the most different plot. Compute the \(p\)-value. What would these results tell you about the typical consumption of the different whiskey brands?
- This analysis ignored structure in the data, that survey participants could report consuming more than one brand. Have a discussion about what complications this might introduce for the analysis that we have just done. What might be an alternative way to compute the “counts” that takes this multiple responses into account? What else might we want to learn about survey participant responses?
👌 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.