ETC5521: Diving Deeply into Data Exploration

Week 1: Introduction

Professor Di Cook

Department of Econometrics and Business Statistics

About this unit

Teaching team 1/2

Di Cook
Distinguished Professor
Monash University

🌐 https://dicook.org/

✉️ ETC5521.Clayton-x@monash.edu

🦣 @visnut@aus.social @visnut.bsky.social

I have a PhD from Rutgers University, NJ, and a Bachelor of Science from University of New England
I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.
My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I likes to develop new methodology and software.
My students work on methods and software that is generally useful for the world. They have been responsible for bringing you the tidyverse suite, knitr, plotly, and many other R packages we regularly use.

Teaching team 2/2

Krisanat Anukarnsakulchularp
Master of Business Analytics
Monash University

🌐 https://github.com/KrisanatA

✉️ ETC5521.Clayton-x@monash.edu

He has a Bachelor of Actuarial Science, Monash University, 2018 - 2021
and a Master of Business Analytics, Monash University | 2022 - 2023.
He has published the R package animbook
and is a first year PhD student at Monash, working on data structures, visualisation and models for spatiotemporal networks.
This is his fourth semester tutoring at Monash, and the only unit working on this semester.

Got a question, or a comment?

✋ 🔡 You can ask directly by unmuting yourself, or typing in the chat, of the live lecture.

💻 If watching the recording, please post questions in the discussion (ED) forum.

I hope you have many questions! 🙋🏻👣

Welcome!

Synopsis

Beyond modelling and prediction, data might have many more stories to tell. Exploring data to uncover patterns and structures, involves both numerical and visual techniques designed to reveal interesting information that may be unexpected. However, an analyst must be cautious not to over-interpret apparent patterns, and to use randomisation tools to assess whether the patterns are real or spurious.

Learning objectives

learn to use modern data exploration tools with many different types of contemporary data to uncover interesting structures, unusual relationships and anomalies.
understand how to map out appropriate analyses for a given data set and description, define what we would expect to see in the data, and whether what we see is contrary to expectations.
be able to compute null samples in order to test apparent patterns, and to interpret the results using computational methods for statistical inference.
critically assess the strength and adequacy of data analysis.

📅 Unit Structure

2 hour lecture 👩‍🏫 Tue 2.00 - 4:00pm, on zoom (see moodle for the link) Class is more fun if you can attend live!
1 hour workshop Tue 4:00 - 5:00pm, on same zoom link. This is based on material during lecture.
1 hour on-campus tutorial 🛠️ Thu 9:00-10:00am, 10:00-11am and 3:00-4:00pm CL_Anc-19.LTB_188 Attendance is expected - this is the chance to practice and get help with assignments from your tutor’s.

📚 Resources

🏡 Course homepage: this is where you find the course materials
(lecture slides, tutorials and tutorial solutions) https://ddde.numbat.space/

🈴 Moodle: this is where you find discussion forum, zoom links, and marks https://learning.monash.edu/course/view.php?id=34784

🧰 GitHub classroom: this is where you will find assignments, but links to each will be available in moodle. https://classroom.github.com/

💯 Assessment Part 1/2

Weekly quizzes (5%) There will be a weekly quiz starting week 2 provided through Moodle. These are a great chance to check your knowledge, and help you prepare for the tutorial and to keep up to date with the weekly course material. Your best 10 scores will be used for your final quiz total.
Exercises 1 (15%), through GitHub classroom, Due: Aug 18, 11:55pm. This is an individual assessment.
Exercises 2 (20%), through GitHub classroom, Due: Sep 1, 11:55pm. This is an individual assessment.
Exercises 3 (20%): through GitHub classroom, Due: Sep 22, 11:55pm. This is an individual assessment.
Project, parts 1 and 2 (20% each), through GitHub classroom, Due: Oct 13, 11:55pm and Nov 3, 11:55pm.

GitHub Classroom

We are going to use GitHub Classroom to distribute assignment templates and keep track of your assignment progress.

Clone the first assignment by clicking on the link given in Moodle.
Once you have accepted it, you will get a cloned copy on your own GitHub account. It is a private repo, which means you and the teaching staff will be the only people with access.
If you need some help getting started, check this information.
The week 1 tutorial will help you get started.

What does it mean to explore data?

https://www.gocomics.com/calvinandhobbes/2015/08/26

A simple example to illustrate “exploratory data analysis” contrasted with a “confirmatory data analysis”

What are the factors that affect tipping behaviour?

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers.

library(tidyverse)
tips <- read_csv("http://ggobi.org/book/data/tips.csv")

What is tipping?

When you’re dining at a full-service restaurant
- Tip 20 percent of your full bill.
When you grab a cup of coffee
- Round up or add a dollar if you’re a regular or ordered a complicated drink.
When you have lunch at a food truck
- Drop a few dollars into the tip jar, but a little less than you would at a dine-in spot.
When you use a gift card
- Tip on the total value of the meal, not just what you paid out of pocket.

The basic rules of tipping that everyone should know about

Recommended procedure in the book

Step 1: Develop a model
- Should the response be tip alone and use the total bill as a predictor?
- Should you create a new variable tip rate and use this as the response?
Step 2: Fit the full model with sex, smoker, day, time and size as predictors
Step 3: Refine model: Should some variables should be dropped?
Step 4: Check distribution of residuals
Step 5: Summarise the model, if X=something, what would be the expected tip

Step 1

Calculate tip % as tip/total bill \(\times\) 100

tips <- tips %>%
  mutate(tip_pct = tip/totbill * 100)

Note: Creating new variables (sometimes called feature engineering), is a common step in any data analysis.

Step 2 Fit

Fit the full model with all variables

tips_lm <- tips %>%
  select(tip_pct, sex, smoker, day, time, size) %>%
  lm(tip_pct ~ ., data=.)

Step 2 Model summary

library(broom)
library(kableExtra)
tidy(tips_lm) %>% 
  kable(digits=2) %>% 
  kable_styling()

glance(tips_lm) %>% 
  select(r.squared, statistic, 
         p.value) %>% 
  kable(digits=3)

term	estimate	std.error	statistic	p.value
(Intercept)	20.66	2.49	8.29	0.00
sexM	-0.85	0.83	-1.02	0.31
smokerYes	0.36	0.85	0.43	0.67
daySat	-0.18	1.83	-0.10	0.92
daySun	1.67	1.90	0.88	0.38
dayThu	-1.82	2.32	-0.78	0.43
timeNight	-2.34	2.61	-0.89	0.37
size	-0.96	0.42	-2.28	0.02

r.squared	statistic	p.value
0.042	1.5	0.17

🤔 Which variable(s) would be considered important for predicting tip %?

Step 3: Refine model

tips_lm <- tips %>%
  select(tip_pct, size) %>% 
  lm(tip_pct ~ ., data=.) 
tidy(tips_lm) %>% 
  kable(digits=2) %>% 
  kable_styling()

glance(tips_lm) %>% 
  select(r.squared, statistic, p.value) %>% 
  kable(digits=3)

term	estimate	std.error	statistic	p.value
(Intercept)	18.44	1.12	16.5	0.00
size	-0.92	0.41	-2.2	0.03

r.squared	statistic	p.value
0.02	5	0.026

Model summary

\[\widehat{tip %} = 18.44 - 0.92 \times size\]

As the size of the dining party increases by one person the tip decreases by approximately 1%.

Model assessment

\(R^2 = 0.02\).

This dropped by half from the full model, even though no other variables contributed significantly to the model. It might be a good step to examine interaction terms.

What does \(R^2 = 0.02\) mean?

Model assessment

\(R^2 = 0.02\) means that size explains just 2% of the variance in tip %. This is a very weak model.

And \(R^2 = 0.04\) is also a very weak model.

What do the \(F\) statistic and \(p\)-value mean?

What do the \(t\) statistics and \(p\)-value associated with model coefficients mean?

Overall model significance

Assume that we have a random sample from a population. Assume that the model for the population is

\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed

\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \] The \(F\) statistic refers to

\[ H_o: \beta_1 = ... = \beta_7 = 0 ~~ vs ~~ H_a: \text{at least one is not 0}\] The \(p\)-value is the probability that we observe the given \(F\) value or larger, computed assuming \(H_o\) is true.

Term significance

Assume that we have a random sample from a population. Assume that the model for the population is

\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed

\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \]

The \(t\) statistics in the coefficient summary refer to

\[ H_o: \beta_k = 0 ~~ vs ~~ H_a: \beta_k \neq 0 \] The \(p\)-value is the probability that we observe the given \(t\) value or more extreme, computed assuming \(H_o\) is true.

Model diagnostics (MD)

Normally, the final model summary would be accompanied diagnostic plots

observed vs fitted values to check strength and appropriateness of the fit
univariate plot, and normal probability plot, of residuals to check for normality
in the simple final model like this, the observed vs predictor, with model overlaid would be advised to assess the model relative to the variability around the model
when the final model has more terms, using a partial dependence plot to check the relative relationship between the response and predictors would be recommended.

Residual plots

tips_aug <- augment(tips_lm)
ggplot(tips_aug, 
    aes(x=.resid)) + 
  geom_histogram() +
  xlab("residuals")

Residual normal probability plots

ggplot(tips_aug, 
    aes(sample=.resid)) + 
  stat_qq() +
  stat_qq_line() +
  xlab("residuals") +
  theme(aspect.ratio=1)

Fitted vs observed

ggplot(tips_aug, 
    aes(x=.fitted, y=tip_pct)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("observed") +
  ylab("fitted")

“Model-in-the-data-space”

ggplot(tips_aug, 
    aes(x=size, y=tip_pct)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  ylab("tip %")

The fitted model is overlaid on a plot of the data. This is called “model-in-the-data-space” (Wickham et al, 2015).

All the plots on the previous three slides: histogram of residuals, normal probability plot, fitted vs residuals are considered to be “data-in-the-model-space”. Stay tuned for more discussion on this later.

The result of this work might leave us with

a model that could be used to impose a dining/tipping policy in restaurants (see here)

but it should also leave us with an unease that this policy is based on weak support.

Summary

Plots as we have just seen, associated with pursuit of an answer to a specific question may be best grouped into the category of “model diagnostics (MD)”.

There are additional categories of plots for data analysis that include initial data analysis (IDA), descriptive statistics. Stay tuned for more on these.

A separate and big area for plots of data is for communication, where we already know what is in the data and we want to communicate the information as best possible.

When exploring data, we are using data plots to discover things we didn’t already know.

What did this analysis miss?

General strategy for EXPLORING DATA

It’s a good idea to examine the data description, the explanation of the variables, and how the data was collected.

You need to know what type of variables are in the data in order to decide appropriate choice of plots, and calculations to make.

Data description should have information about data collection methods, so that the extent of what we learn from the data might apply to new data.

What does that look like here?

glimpse(tips)

Rows: 244
Columns: 9
$ obs     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ totbill <dbl> 17.0, 10.3, 21.0, 23.7, 24.6, 25…
$ tip     <dbl> 1.0, 1.7, 3.5, 3.3, 3.6, 4.7, 2.…
$ sex     <chr> "F", "M", "M", "M", "F", "M", "M…
$ smoker  <chr> "No", "No", "No", "No", "No", "N…
$ day     <chr> "Sun", "Sun", "Sun", "Sun", "Sun…
$ time    <chr> "Night", "Night", "Night", "Nigh…
$ size    <dbl> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2,…
$ tip_pct <dbl> 5.9, 16.1, 16.7, 14.0, 14.7, 18.…

Look at the distribution of quantitative variables tips, total bill.

Examine the distributions across categorical variables.

Examine quantitative variables relative to categorical variables

Distributions of tips

ggplot(tips, 
    aes(x=tip)) + 
  geom_histogram(
    colour="white")

Because, one binwidth is never enough …

Distributions of tips

ggplot(tips, 
    aes(x=tip)) +
  geom_histogram(
    breaks=seq(0.5,10.5,1),  
    colour="white") + 
  scale_x_continuous(
    breaks=seq(0,11,1))

Big fat bins. Tips are skewed, which means most tips are relatively small.

Distributions of tips

ggplot(tips, 
    aes(x=tip)) + 
  geom_histogram(
    breaks=seq(0.5,10.5,0.1), 
    colour="white") +
  scale_x_continuous(
    breaks=seq(0,11,1))

Skinny bins. Tips are multimodal, and occurring at the full dollar and 50c amounts.

We could also look at total bill this way

but I’ve already done this, and we don’t learn anything more about the multiple peaks than waht is learned by plotting tips.

Relationship between tip and total

p <- ggplot(tips, 
    aes(x= totbill, y=tip)) + 
  geom_point() + 
  scale_y_continuous(
    breaks=seq(0,11,1))
p

Why is total on the x axis?

Should we add a guideline?

Add a guideline indicating common practice

p <- p + geom_abline(intercept=0, 
              slope=0.2) + 
  annotate("text", x=45, y=10, 
           label="20% tip") 
p

Most tips less than 20%: Skin flints vs generous diners
A couple of big tips
Banding horizontally is the rounding seen previously

We should examine bar charts and mosaic plots of the categorical variables next

but I’ve already done that, and there’s not too much of interest there.

Relative to categorical variables

p + facet_grid(smoker~sex)

The bigger bills tend to be paid by men (and females that smoke).
Except for three diners, female non-smokers are very consistent tippers, probably around 15-18% though.
The variability in the smokers is much higher than for the non-smokers.

Isn’t this interesting?

Procedure of EDA

We gained a wealth of insight in a short time.
Using nothing but graphical methods we investigated univariate, bivariate, and multivariate relationships.
We found both global features and local detail. We saw that
- tips were rounded; then we saw the obvious
- correlation between the tip and the size of the bill, noting the scarcity of generous tippers; finally we
- discovered differences in the tipping behavior of male and female smokers and non-smokers.

These are unexpected insights were missed from the analysis that focused solely on the primary question.

What can go wrong?

How was data collected?

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.

How much can you infer about tipping more broadly?

Tip has a weak but significant relationship with total bill?
Tips have a skewed distribution? (More small tips and fewer large tips?)
Tips tend to be made in nice round numbers.
People generally under-tip?
Smokers are less reliable tippers.

Ways to verify, support or refute generalisations

external information
other studies/samples
good choice of calculations and plots
all the permutations and subsets of measured variables
computational re-sampling methods (we’ll see these soon)

Poor data collection methods affects every analysis, including statistical or computational modeling.

For this waiter and the restaurant manager, there is some useful information. Like what?

Service fee for smokers to ensure consistency?
Assign waiter to variety of party sizes and composition.
Shifts on different days or time of day (not shown).

Words of wisdom

False discovery is the lesser danger when compared to non-discovery. Non-discovery is the failure to identify meaningful structure, and it may result in false or incomplete modeling. In a healthy scientific enterprise, the fear of non-discovery should be at least as great as the fear of false discovery.

Guide

Read the data description to understand the context, and the extent of the data collection.
Understand the types of variables that have been measured.
Brainstorm a set of questions that might be interesting to answer with this data.
For each of your question, write down what you EXPECT to find.
Map out the possible plots and numerical summaries to make, that could be made, but particularly, what needs to be computed in order to answer the questions.

What potential errors might be in the data? Missing values, not recorded data, errors in coding, and think about the strategy to deal with them.
Start on your analysis. Think about the results suggested and whether they match or contradict what you expected.
Use randomisation methods to learn whether what you have observed is possibly spurious.
Communicate findings.

Topics covered

Methods for single, bivariate, multivariate
- numerical variables
- categorical variables
Methods to accommodate temporal and spatial (maybe also networks) context
How to make effective comparisons
Utilising computational methods to assess what you see is “real”

Resources

Cook and Swayne (2007) Interactive and Dynamic Graphics for Data Analysis, Introduction
Donoho (2017) 50 Years of Data Science
Staniak and Biecek (2019) The Landscape of R Packages for Automated Exploratory Data Analysis