Week 1: Introduction
Department of Econometrics and Business Statistics
Di Cook
Distinguished Professor
Monash University
🌐 https://dicook.org/
✉️ ETC5521.Clayton-x@monash.edu
🦣 @visnut@aus.social
I have a PhD from Rutgers University, NJ, and a Bachelor of Science from University of New England
I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.
My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I likes to develop new methodology and software.
My students always work on methods and software that is generally useful for the world. They have been responsible for bringing you the tidyverse suite, knitr, plotly, and many other R packages we regularly use.
Krisanat Anukarnsakulchularp
Master of Business Analytics
Monash University
🌐 https://github.com/KrisanatA
✉️ ETC5521.Clayton-x@monash.edu
✋ 🔡 You can ask directly by unmuting yourself, or typing in the chat, of the live lecture.
💻 If watching the recording, please post in the discussion (ED) forum.
Beyond modelling and prediction, data might have many more stories to tell. Exploring data to uncover patterns and structures, involves both numerical and visual techniques designed to reveal interesting information that may be unexpected. However, an analyst must be cautious not to over-interpret apparent patterns, and to use randomisation tools to assess whether the patterns are real or spurious.
2 hour lecture 👩🏫 Tue 10.00am - noon, on zoom (see moodle for the link) Class is more fun if you can attend live!
2 x 1.5 hour on-campus tutorial 🛠️ Wed 9:30-11:00 and Wed 7:30-9:00pm CL_Anc-19.LTB_134 Attendance is expected - this is the chance to practice what is explained in lecture under your tutor’s guidance.
🏡 Course homepage: this is where you find the course materials
(lecture slides, tutorials and tutorial solutions) https://ddde.numbat.space/
🈴 Moodle: this is where you find discussion forum, zoom links, and marks https://learning.monash.edu/course/view.php?id=18864
🧰 GitHub classroom: this is where you will find assignments, but links to each will be available in moodle. https://classroom.github.com/classrooms/175896553-etc5521-2024-classroom-29a96a
Weekly quizzes (5%) There will be a weekly quiz starting week 2 provided through Moodle. These are a great chance to check your knowledge, and help you prepare for the tutorial and to keep up to date with the weekly course material. Your best 10 scores will be used for your final quiz total.
Assignment 1 (15%), through GitHub classroom, Due: Aug 5, 11:55pm. This is an individual assessment.
Assignment 2 (20%), through GitHub classroom, Due: Aug 26, 11:55pm. This is an individual assessment.
Assignment 3 (20%): through GitHub classroom, Due: Sep 16, 11:55pm. This is an individual assessment.
Assignment 4, parts 1 and 2 (20% each), through GitHub classroom, Due: Oct 7, 11:55pm and Oct 28, 11:55pm.
We are going to use GitHub Classroom (etc5521 2024: Diving Deeper into Data Exploration) to distribute assignment templates and keep track of your assignment progress.
What’s special about exploring data, in contrast to confirmatory data analysis?
Let’s look at some common definitions and quotes of “exploratory data analysis”.
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to (1) maximize insight into a data set; (2) uncover underlying structure; (3) extract important variables; (4) detect outliers and anomalies; (5) test underlying assumptions; (6) develop parsimonious models; and (7) determine optimal factor settings.
What is Exploratory Data Analysis (EDA)? (1) How to ensure you are ready to use machine learning algorithms in a project? (2) How to choose the most suitable algorithms for your data set? (3) How to define the feature variables that can potentially be used for machine learning?
EDA is necessary for the next stage of data research. If there was an analogy to exploratory data analysis, it would be that of a painter examining their tools and available time, before deciding on what best to paint.
These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data.
The purpose of doing the Exploratory Data Analysis or EDA is to find new information in data. The understanding of EDA that practitioners may not aware of, is the EDA uses a visually-examined dataset to understand and summarize the main characteristics of the dataset without having a prior hypothesis or relying upon statistical models.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.
Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers.
tip
alone and use the total bill as a predictor?tip rate
and use this as the response?Calculate tip % as tip/total bill \(\times\) 100
Note: Creating new variables (sometimes called feature engineering), is a common step in any data analysis.
Fit the full model with all variables
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 20.66 | 2.49 | 8.29 | 0.00 |
sexM | -0.85 | 0.83 | -1.02 | 0.31 |
smokerYes | 0.36 | 0.85 | 0.43 | 0.67 |
daySat | -0.18 | 1.83 | -0.10 | 0.92 |
daySun | 1.67 | 1.90 | 0.88 | 0.38 |
dayThu | -1.82 | 2.32 | -0.78 | 0.43 |
timeNight | -2.34 | 2.61 | -0.89 | 0.37 |
size | -0.96 | 0.42 | -2.28 | 0.02 |
r.squared | statistic | p.value |
---|---|---|
0.042 | 1.5 | 0.17 |
🤔 Which variable(s) would be considered important for predicting tip %?
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 18.44 | 1.12 | 16.5 | 0.00 |
size | -0.92 | 0.41 | -2.2 | 0.03 |
r.squared | statistic | p.value |
---|---|---|
0.02 | 5 | 0.026 |
\[\widehat{tip %} = 18.44 - 0.92 \times size\]
As the size of the dining party increases by one person the tip decreases by approximately 1%.
\(R^2 = 0.02\).
This dropped by half from the full model, even though no other variables contributed significantly to the model. It might be a good step to examine interaction terms.
What does \(R^2 = 0.02\) mean?
\(R^2 = 0.02\) means that size explains just 2% of the variance in tip %. This is a very weak model.
And \(R^2 = 0.04\) is also a very weak model.
What do the \(F\) statistic and \(p\)-value mean?
What do the \(t\) statistics and \(p\)-value associated with model coefficients mean?
Assume that we have a random sample from a population. Assume that the model for the population is
\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed
\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \] The \(F\) statistic refers to
\[ H_o: \beta_1 = ... = \beta_7 = 0 ~~ vs ~~ H_a: \text{at least one is not 0}\] The \(p\)-value is the probability that we observe the given \(F\) value or larger, computed assuming \(H_o\) is true.
Assume that we have a random sample from a population. Assume that the model for the population is
\[ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size \] and we have observed
\[ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size \]
The \(t\) statistics in the coefficient summary refer to
\[ H_o: \beta_k = 0 ~~ vs ~~ H_a: \beta_k \neq 0 \] The \(p\)-value is the probability that we observe the given \(t\) value or more extreme, computed assuming \(H_o\) is true.
Normally, the final model summary would be accompanied diagnostic plots
The fitted model is overlaid on a plot of the data. This is called “model-in-the-data-space” (Wickham et al, 2015).
All the plots on the previous three slides: histogram of residuals, normal probability plot, fitted vs residuals are considered to be “data-in-the-model-space”. Stay tuned for more discussion on this later.
The result of this work might leave us with
a model that could be used to impose a dining/tipping policy in restaurants (see here)
but it should also leave us with an unease that this policy is based on weak support.
Plots as we have just seen, associated with pursuit of an answer to a specific question may be best grouped into the category of “model diagnostics (MD)”.
There are additional categories of plots for data analysis that include initial data analysis (IDA), descriptive statistics. Stay tuned for more on these.
A separate and big area for plots of data is for communication, where we already know what is in the data and we want to communicate the information as best possible.
When exploring data, we are using data plots to discover things we didn’t already know.
Its a good idea to examine the data description, the explanation of the variables, and how the data was collected.
What does that look like here?
Rows: 244
Columns: 9
$ obs <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ totbill <dbl> 17.0, 10.3, 21.0, 23.7, 24.6, 25…
$ tip <dbl> 1.0, 1.7, 3.5, 3.3, 3.6, 4.7, 2.…
$ sex <chr> "F", "M", "M", "M", "F", "M", "M…
$ smoker <chr> "No", "No", "No", "No", "No", "N…
$ day <chr> "Sun", "Sun", "Sun", "Sun", "Sun…
$ time <chr> "Night", "Night", "Night", "Nigh…
$ size <dbl> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2,…
$ tip_pct <dbl> 5.9, 16.1, 16.7, 14.0, 14.7, 18.…
Look at the distribution of quantitative variables tips, total bill.
Examine the distributions across categorical variables.
Examine quantitative variables relative to categorical variables
but I’ve already done this, and we don’t learn anything more about the multiple peaks than waht is learned by plotting tips.
but I’ve already done that, and there’s not too much of interest there.
These are unexpected insights were missed from the analysis that focused solely on the primary question.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990.
How much can you infer about tipping more broadly?
Poor data collection methods affects every analysis, including statistical or computational modeling.
For this waiter and the restaurant manager, there is some useful information. Like what?
False discovery is the lesser danger when compared to non-discovery. Non-discovery is the failure to identify meaningful structure, and it may result in false or incomplete modeling. In a healthy scientific enterprise, the fear of non-discovery should be at least as great as the fear of false discovery.
ETC5521 Lecture 1 | ddde.numbat.space