Week 1: Introduction
Department of Econometrics and Business Statistics
Di Cook
Distinguished Professor
Monash University
🌐 https://dicook.org/
✉️ ETC5521.Clayton-x@monash.edu
🦣 @visnut@aus.social @visnut.bsky.social
I have a PhD from Rutgers University, NJ, and a Bachelor of Science from University of New England
I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.
My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I likes to develop new methodology and software.
My students work on methods and software that is generally useful for the world. They have been responsible for bringing you the tidyverse suite, knitr, plotly, and many other R packages we regularly use.
Krisanat Anukarnsakulchularp
Master of Business Analytics
Monash University
🌐 https://github.com/KrisanatA
✉️ ETC5521.Clayton-x@monash.edu
✋ 🔡 You can ask directly by unmuting yourself, or typing in the chat, of the live lecture.
💻 If watching the recording, please post questions in the discussion (ED) forum.
I hope you have many questions! 🙋🏻👣
Beyond modelling and prediction, data might have many more stories to tell. Exploring data to uncover patterns and structures, involves both numerical and visual techniques designed to reveal interesting information that may be unexpected. However, an analyst must be cautious not to over-interpret apparent patterns, and to use randomisation tools to assess whether the patterns are real or spurious.
2 hour lecture 👩🏫 Tue 2.00 - 4:00pm, on zoom (see moodle for the link) Class is more fun if you can attend live!
1 hour workshop Tue 4:00 - 5:00pm, on same zoom link. This is based on material during lecture.
1 hour on-campus tutorial 🛠️ Thu 9:00-10:00am, 10:00-11am and 3:00-4:00pm CL_Anc-19.LTB_188 Attendance is expected - this is the chance to practice and get help with assignments from your tutor’s.
🏡 Course homepage: this is where you find the course materials
(lecture slides, tutorials and tutorial solutions) https://ddde.numbat.space/
🈴 Moodle: this is where you find discussion forum, zoom links, and marks https://learning.monash.edu/course/view.php?id=34784
🧰 GitHub classroom: this is where you will find assignments, but links to each will be available in moodle. https://classroom.github.com/
Weekly quizzes (5%) There will be a weekly quiz starting week 2 provided through Moodle. These are a great chance to check your knowledge, and help you prepare for the tutorial and to keep up to date with the weekly course material. Your best 10 scores will be used for your final quiz total.
Exercises 1 (15%), through GitHub classroom, Due: Aug 18, 11:55pm. This is an individual assessment.
Exercises 2 (20%), through GitHub classroom, Due: Sep 1, 11:55pm. This is an individual assessment.
Exercises 3 (20%): through GitHub classroom, Due: Sep 22, 11:55pm. This is an individual assessment.
Project, parts 1 and 2 (20% each), through GitHub classroom, Due: Oct 13, 11:55pm and Nov 3, 11:55pm.
We are going to use GitHub Classroom to distribute assignment templates and keep track of your assignment progress.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.
Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers.
tip
alone and use the total bill as a predictor?tip rate
and use this as the response?Calculate tip % as tip/total bill \(\times\) 100
Note: Creating new variables (sometimes called feature engineering), is a common step in any data analysis.
Fit the full model with all variables
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 20.66 | 2.49 | 8.29 | 0.00 |
sexM | -0.85 | 0.83 | -1.02 | 0.31 |
smokerYes | 0.36 | 0.85 | 0.43 | 0.67 |
daySat | -0.18 | 1.83 | -0.10 | 0.92 |
daySun | 1.67 | 1.90 | 0.88 | 0.38 |
dayThu | -1.82 | 2.32 | -0.78 | 0.43 |
timeNight | -2.34 | 2.61 | -0.89 | 0.37 |
size | -0.96 | 0.42 | -2.28 | 0.02 |
r.squared | statistic | p.value |
---|---|---|
0.042 | 1.5 | 0.17 |
🤔 Which variable(s) would be considered important for predicting tip %?
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 18.44 | 1.12 | 16.5 | 0.00 |
size | -0.92 | 0.41 | -2.2 | 0.03 |
r.squared | statistic | p.value |
---|---|---|
0.02 | 5 | 0.026 |
\[\widehat{tip %} = 18.44 - 0.92 \times size\]
As the size of the dining party increases by one person the tip decreases by approximately 1%.
\(R^2 = 0.02\).
This dropped by half from the full model, even though no other variables contributed significantly to the model. It might be a good step to examine interaction terms.
What does \(R^2 = 0.02\) mean?
\(R^2 = 0.02\) means that size explains just 2% of the variance in tip %. This is a very weak model.
And \(R^2 = 0.04\) is also a very weak model.
What do the \(F\) statistic and \(p\)-value mean?
What do the \(t\) statistics and \(p\)-value associated with model coefficients mean?
Assume that we have a random sample from a population. Assume that the model for the population is
\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed
\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \] The \(F\) statistic refers to
\[ H_o: \beta_1 = ... = \beta_7 = 0 ~~ vs ~~ H_a: \text{at least one is not 0}\] The \(p\)-value is the probability that we observe the given \(F\) value or larger, computed assuming \(H_o\) is true.
Assume that we have a random sample from a population. Assume that the model for the population is
\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed
\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \]
The \(t\) statistics in the coefficient summary refer to
\[ H_o: \beta_k = 0 ~~ vs ~~ H_a: \beta_k \neq 0 \] The \(p\)-value is the probability that we observe the given \(t\) value or more extreme, computed assuming \(H_o\) is true.
Normally, the final model summary would be accompanied diagnostic plots
The fitted model is overlaid on a plot of the data. This is called “model-in-the-data-space” (Wickham et al, 2015).
All the plots on the previous three slides: histogram of residuals, normal probability plot, fitted vs residuals are considered to be “data-in-the-model-space”. Stay tuned for more discussion on this later.
The result of this work might leave us with
a model that could be used to impose a dining/tipping policy in restaurants (see here)
but it should also leave us with an unease that this policy is based on weak support.
Plots as we have just seen, associated with pursuit of an answer to a specific question may be best grouped into the category of “model diagnostics (MD)”.
There are additional categories of plots for data analysis that include initial data analysis (IDA), descriptive statistics. Stay tuned for more on these.
A separate and big area for plots of data is for communication, where we already know what is in the data and we want to communicate the information as best possible.
When exploring data, we are using data plots to discover things we didn’t already know.
It’s a good idea to examine the data description, the explanation of the variables, and how the data was collected.
You need to know what type of variables are in the data in order to decide appropriate choice of plots, and calculations to make.
Data description should have information about data collection methods, so that the extent of what we learn from the data might apply to new data.
What does that look like here?
Rows: 244
Columns: 9
$ obs <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ totbill <dbl> 17.0, 10.3, 21.0, 23.7, 24.6, 25…
$ tip <dbl> 1.0, 1.7, 3.5, 3.3, 3.6, 4.7, 2.…
$ sex <chr> "F", "M", "M", "M", "F", "M", "M…
$ smoker <chr> "No", "No", "No", "No", "No", "N…
$ day <chr> "Sun", "Sun", "Sun", "Sun", "Sun…
$ time <chr> "Night", "Night", "Night", "Nigh…
$ size <dbl> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2,…
$ tip_pct <dbl> 5.9, 16.1, 16.7, 14.0, 14.7, 18.…
Look at the distribution of quantitative variables tips, total bill.
Examine the distributions across categorical variables.
Examine quantitative variables relative to categorical variables
but I’ve already done this, and we don’t learn anything more about the multiple peaks than waht is learned by plotting tips.
but I’ve already done that, and there’s not too much of interest there.
These are unexpected insights were missed from the analysis that focused solely on the primary question.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.
How much can you infer about tipping more broadly?
Poor data collection methods affects every analysis, including statistical or computational modeling.
For this waiter and the restaurant manager, there is some useful information. Like what?
False discovery is the lesser danger when compared to non-discovery. Non-discovery is the failure to identify meaningful structure, and it may result in false or incomplete modeling. In a healthy scientific enterprise, the fear of non-discovery should be at least as great as the fear of false discovery.
ETC5521 Lecture 1 | ddde.numbat.space