ETC5521 Tutorial 5

Statistical inference for exploratory methods

Author

Prof. Di Cook

Published

19 August 2024

🎯 Objectives

Refresh thinking about statistical inference.
Learn to apply inference for data plots.

🔧 Preparation

The reading for this week is Wickham et al. (2010) Graphical inference for Infovis. It is a basic introduction to inference for exploratory data analysis, especially for data visualisation. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("tidyverse", "datarium", "broom", "nullabor"))

Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Skittles experiment

Skittles come in five colors (orange, yellow, red, purple, green) each with their own flavours (orange, lemon, strawberry, grape, green apple). Data was collected by Dr Nick Tierney to explore whether a sample of 3 people could identify the flavour of skittles while blindfolded. You can find the cleaned tidy data here.

How many skittles did each person taste?

A person with loss of taste is called ageusia and a person who has a loss of smell is called anosmia. The loss of taste and loss of smell will not allow you to distinguish flavours in food. What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly (out of the five flavours) for one skittle?

What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly for 2 out of 10 skittles, assuming the order of taste does not matter?

Test the null hypothesis that people cannot distinguish the flavours correctly, against the alternative that they can. Assume that the order of tasting does not matter and each person has the same ability to correctly identify the flavours. In conducting your test, define your null and alternate hypothesis, in statistical notation, your assumptions, the test statistics and calculate the $p$-value.

In part (d) we disregarded the order of the tasting and the possible variability in people’s ability to correctly identify the flavour. If in fact these do matter, then how would you construct the test statistic? Is it easy?

Consider the plot below that shows in each tile whether a person guessed correctly by order of their tasting. Suppose that under the null hypothesis, the order of tasting does not matter and people have no ability to distinguish the flavours. Generate a null plot under this null hypothesis.

Based on (f), construct a lineup (using nullabor or otherwise) of 20 plots. Ask your classmate, which plot looks different.

Suppose that you have a response from 100 people based on your line-up from (g) and 76 correctly identified the data plot. What is the $p$-value from this visual inference?

Now consider the plot below. Use the same null data in (g) to construct a lineup based on below visual statistic. Suppose we had 28 people out of 100 who correctly identified the data plot in this lineup. What is the difference in power of visual statistic in (f) and this one?

Exercise 2: Social media marketing

The data marketing in the datarium R-package contains information on sales with advertising budget for three advertising media (youtube, facebook and newspaper). This advertising experiment was repeated 200 times to study the impact of the advertisting media on sales.

data(marketing, package = "datarium")

Study the pairs plot. Which of the advertising medium do you think affects the sales?

Construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube and newspaper. (You may like to make the intervals non-overlapping to make it easier to plot in ggplot). What do you see in the plot?

Now construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube alone. Superimpose a linear model on each facet. Is there an interval where the linear model is not a good fit?

Consider the following interaction model (which has the same symbolic model formulae as sales ~ facebook*youtube) for data where the advertising budget for youtube is at least $90,000. Construct a QQ-plot of the residuals. Do you think the errors are normally distributed? Construct a lineup for the QQ-plot assuming that the null distribution is Normally distributed with mean zero and variance as estimated from the model fit.

Exercise 3: IDA skill sprint

Set the timer. You have 15 minutes to discover as many problems as possible in this data, cafe.rda.

A small cafe in the city of Melbourne is interested in determining whether the daily earnings depend on the weather. They compiled data for a period over 2000-2001 to study this question. The data has the following variables:

var	description
dt	Date
wday	Day of the week
revenue	Daily revenue in hundreds, 11=1100
expend	Daily expenses in hundreds
precip	Precipitation in mm
mint	Minimum temperature, Celsius
maxt	Maximum temperature, Celsius
source	Source of the weather data

Your tutor has the list of problems.

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.