ETC5521: Diving Deeply into Data Exploration

Sculpting data using models, checking assumptions, co-dependency and performing diagnostics

Professor Di Cook

Department of Econometrics and Business Statistics

Outline

Different types of model fitting
Decomposing data from model
- fitted
- residual
Diagnostic calculations
- anomalies
- leverage
- influence

Models can be used to re-focus the view of data

Different types of model fitting

The basic form for fitting a model with data (response \(Y\) and predictors \(X\)) is:

\[ Y = f(X) + \varepsilon \]

and \(X\) could be include multiple variables, \(X = (X_{1}, X_{2}, \dots, X_{p})\) where \(p\) is the number of variables. We have a sample of \(n\) observations, \(y_i, x_{i1}, \dots x_{ip}, ~~~ i=1, \dots, n\).

In a parametric model, the form of \(f\) is specified, e.g. \(\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\), and one would estimate the parameters \(\beta_0, \beta_1, \beta_2, \beta_3\).
- Frequentist fitting assumes that parameters are fixed values.
- In a Bayesian framework, the parameters are assumed to have a distribution, e.g. Gaussian.
In a non-parametric model, the form of \(f\) is NOT specified but fitted from the data. May not have a specific functional form, and needs more data, typically. Imposes less assumptions. Can be done in a Bayesian framework.
Different types of variables can change the model specification, e.g. binary or categorical \(Y\), or temporal or spatial context.
Different model products, e.g. fitted values or residuals, after the fit change the lens with which we view the data.

Parametric regression

Specification

Specify the

functional form, e.g. function form is has linear and quadratic terms

\[f(X) = \beta_0 + \beta_1 X + \beta_2 X^2\]

distribution of errors, e.g.

\[\varepsilon \sim N(0, \sigma^2)\]

Fitting results in:

fitted values, \(\widehat{y}\) (sharpening)
residuals, \(e = y-\widehat{y}\) (what did we miss)

Code

mtcars_fit <- lm(mpg~poly(hp, 2), data=mtcars)
mtcars_all <- augment(mtcars_fit, mtcars)
p1 <- ggplot(mtcars_all, aes(x=hp, y=mpg)) +
  geom_point() +
  ylim(c(10, 35)) +
  ggtitle("data")
p2 <- ggplot(mtcars_all, aes(x=hp, y=.fitted)) +
  geom_point() +
  ylab("f") +
  ylim(c(10, 35)) +
  ggtitle("model")
p1 + p2 + plot_layout(ncol=2)

Code

mtcars_coefs <- tidy(mtcars_fit)
mtcars_stats <- glance(mtcars_fit)
mtcars_coefs

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      20.1     0.544     36.9  6.15e-26
2 poly(hp, 2)1    -26.0     3.08      -8.46 2.51e- 9
3 poly(hp, 2)2     13.2     3.08       4.27 1.89e- 4

Code

mtcars_stats

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic      p.value    df
      <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>
1     0.756         0.739  3.08      45.0      1.30e-9     2
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Diagnostics (1/3)

Residuals, \(e = y-\widehat{y}\) (what doesn’t the fitted model see?)

Should be consistent with a sample from the specified error model
Should have no relationship with the response variable

Lineup
Normal?
Lineup
Relationship?

Code

set.seed(1026)
ggplot(lineup(null_dist(".resid", "norm"), n=12, true=mtcars_all), 
       aes(x=.resid)) +
  geom_histogram(binwidth=1, colour="white") +
  facet_wrap(~.sample, ncol=4) +
  theme(axis.title = element_blank(),
        axis.text = element_blank())

Code

p3 <- ggplot(mtcars_all, aes(x=.resid)) +
  geom_histogram(binwidth=1, colour="white") 
p4 <- ggplot(mtcars_all, aes(sample=.resid)) +
  geom_qq_line() +
  geom_qq() 
p5 <- ggplot(mtcars_all, aes(x=1, y=.resid)) +
  geom_quasirandom() 
p6 <- ggplot(mtcars_all, aes(x=1, y=.resid)) +
  geom_violin(fill="#D93F00", colour="white", 
              draw_quantiles=c(0.25, 0.5, 0.75)) 

p3 + p4 + p5 + p6 + plot_layout(ncol=2)

Code

set.seed(1058)
ggplot(lineup(null_dist(".resid", "norm"), n=12, true=mtcars_all), 
       aes(x=hp, y=.resid)) +
  geom_point() +
  facet_wrap(~.sample, ncol=4) +
  theme(axis.title = element_blank(),
        axis.text = element_blank())

Code

p7 <- ggplot(mtcars_all, aes(x=hp, y=.resid)) +
  geom_point() +
  ylab("e") +
  ylim(c(-3, 3)) +
  ggtitle("residuals")
mt_full_fit <- tibble(hp = seq(50, 340, 10))
mt_full_fit <- mt_full_fit |>
                mutate(mpg = predict(mtcars_fit, mt_full_fit))
p8 <- ggplot(mtcars_all, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mt_full_fit, colour="#D93F00", linewidth=2) +
  ylim(c(10, 35)) +
  ggtitle("data+f")

p7 + p8 + plot_layout(ncol=2)

Diagnostics (2/3)

Leverage

The matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is referred to as the hat matrix.
The \(i\)-th diagonal element of \(\mathbf{H}\), \(h_{ii}\), is called the leverage of the \(i\)-th observation.
Leverages are always between zero and one, \[0 \leq h_{ii} \leq 1.\]
Notice that leverages are not dependent on the response!
Points with high leverage can exert a lot of influence on the parameter estimates

Studentized residuals

In order to obtain residuals with equal variance, many texts recommend using the studentised residuals \[e_i^* = \dfrac{e_i} {\widehat{\sigma} \sqrt{1 - h_{ii}}}\] for diagnostic checks.

Cook’s distance

Cook’s distance, \(D\), is another measure of influence: \[\begin{eqnarray*} D_i &=& \dfrac{(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})^\top Var(\widehat{\boldsymbol{\beta}})^{-1}(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})}{p}\\ &=&\frac{e_i^2 h_{ii}}{(1-h_{ii})^2p\widehat\sigma^2}, \end{eqnarray*}\] where \(p\) is the number of elements in \(\boldsymbol{\beta}\), \(\widehat{\boldsymbol{\beta}}_{[-i]}\) and \(\widehat Y_{j[-i]}\) are least squares estimates and the fitted value obtained by fitting the model ignoring the \(i\)-th data point \((\boldsymbol{x}_i,Y_i)\), respectively.

Diagnostics (3/3)

Code

mtcars_all <- mtcars_all |>
  mutate(id = 1:n())
p10 <- ggplot(mtcars_all, aes(x=id, y=.hat, label=id)) +
  geom_point() +
  labs(title="leverage")
p11 <- ggplot(mtcars_all, aes(x=id, y=.cooksd, label=id)) +
  geom_point() +
  labs(title="cooks d") 
p12 <- ggplot(mtcars_all, aes(x=.resid, y=.std.resid, label=id)) +
  geom_point() +
  labs(title="studentised residuals")

Simulation

Generate response values for un-collected predictor values

mt_full_fit <- tibble(hp = seq(50, 340, 10))
mt_full_fit <- mt_full_fit |>
                mutate(mpg = predict(mtcars_fit, mt_full_fit))

Simulate new samples

Code

set.seed(1249)
mt_sample1 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample1 <- mt_sample1 |>
                mutate(mpg = predict(mtcars_fit, mt_sample1) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample1 <- bind_rows(mt_sample1, mtcars_sub)

ggplot(mt_sample1, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35))

Code

set.seed(1250)
mt_sample2 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample2 <- mt_sample2 |>
                mutate(mpg = predict(mtcars_fit, mt_sample2) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample2 <- bind_rows(mt_sample2, mtcars_sub)

ggplot(mt_sample2, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35))

Code

set.seed(1350)
mt_sample3 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample3 <- mt_sample3 |>
                mutate(mpg = predict(mtcars_fit, mt_sample3) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample3 <- bind_rows(mt_sample3, mtcars_sub)

ggplot(mt_sample3, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35))

What can go wrong with parametric model?

Wrong specification

Specify function form is has only linear term

\[f(X) = \beta_0 + \beta_1 X\]

Code

mtcars_fit2 <- lm(mpg~hp, data=mtcars)
mtcars_all2 <- augment(mtcars_fit2, mtcars)
sp1 <- ggplot(mtcars_all2, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_point(aes(x=hp, y=.fitted), colour="#D93F00") + 
  ylim(c(10, 35)) +
  ggtitle("data+model")
sp2 <- ggplot(mtcars_all2, aes(x=hp, y=.resid)) +
  geom_hline(yintercept=0, colour="grey60") +
  geom_point() +
  ylab("e") +
  ggtitle("residuals")
sp1 + sp2 + plot_layout(ncol=2)

Polynomial

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      20.1     0.544     36.9  6.15e-26
2 poly(hp, 2)1    -26.0     3.08      -8.46 2.51e- 9
3 poly(hp, 2)2     13.2     3.08       4.27 1.89e- 4

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic      p.value    df
      <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>
1     0.756         0.739  3.08      45.0      1.30e-9     2
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Linear

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  30.1       1.63       18.4  6.64e-18
2 hp           -0.0682    0.0101     -6.74 1.79e- 7

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>
1     0.602         0.589  3.86      45.5 0.000000179     1
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Extrapolating

Generate response values for un-collected predictor values OUTSIDE of domain of collected data, can produce HALLUCINATIONS.

Code

mtcars_fit3 <- lm(mpg~poly(hp, 5), data=mtcars)
mtcars_all3 <- augment(mtcars_fit3, mtcars)
mtcars_stats3 <- glance(mtcars_fit3)

set.seed(129)
mt_extrap <- tibble(hp = seq(250, 450, 10),
                    e = rnorm(length(hp), sd=mtcars_stats3$sigma))
mt_extrap <- mt_extrap |>
                mutate(mpg = predict(mtcars_fit3, mt_extrap) + e) |>
  mutate(type = "new") |>
  select(-e)
mt_extrap <- bind_rows(mt_extrap, mtcars_sub)

Multiple variables

Missing terms

Code

mtcars_fit4 <- lm(mpg~poly(hp, 2)+am, data=mtcars)
mtcars_all4 <- augment(mtcars_fit4, mtcars)
mtcars_coefs4 <- tidy(mtcars_fit4)
mtcars_stats4 <- glance(mtcars_fit4)
mtcars_fit5 <- lm(mpg~poly(hp, 2)*am, data=mtcars)
mtcars_all5 <- augment(mtcars_fit5, mtcars)
mtcars_coefs5 <- tidy(mtcars_fit5)
mtcars_stats5 <- glance(mtcars_fit5)
ip1 <- ggplot(mtcars_all4, aes(x=hp, y=.fitted, colour=am)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  ggtitle("No interaction") +
  theme(legend.position = "None")
ip2 <- ggplot(mtcars_all5, aes(x=hp, y=.fitted, colour=am)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  ggtitle("With interaction") +
  theme(legend.position = "None")

# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     18.6      0.669     27.8  6.37e-22
2 poly(hp, 2)1   -23.5      2.79      -8.43 3.58e- 9
3 poly(hp, 2)2     7.88     3.14       2.51 1.80e- 2
4 ammanual         3.75     1.16       3.22 3.20e- 3

# A tibble: 6 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             18.4      0.784    23.5   5.05e-19
2 poly(hp, 2)1           -20.4      5.02     -4.07  3.93e- 4
3 poly(hp, 2)2             7.02     7.09      0.990 3.32e- 1
4 ammanual                 3.55     1.22      2.92  7.11e- 3
5 poly(hp, 2)1:ammanu…    -5.97     6.36     -0.940 3.56e- 1
6 poly(hp, 2)2:ammanu…     2.93     8.12      0.361 7.21e- 1

Non-parametric model

Smoothing splines

We’ve seen loess, which fits a linear model in a sliding window over predictor, where span controls size of window.

Code

mtcars_fit6 <- loess(mpg~hp, data=mtcars, degree=1, span=0.2)
mtcars_fit7 <- loess(mpg~hp, data=mtcars, degree=1, span=0.5)
mtcars_lo <- mtcars |>
  mutate(fit6 = mtcars_fit6$fitted,
         fit7 = mtcars_fit7$fitted)
mtcars_lop <- tibble(hp = seq(50, 340, 5))
mtcars_lop <- mtcars_lop |>
                mutate(mpg6 = predict(mtcars_fit6, mtcars_lop),
                       mpg7 = predict(mtcars_fit7, mtcars_lop))

ggplot(data=mtcars_lo, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mtcars_lop, aes(x=hp, y=mpg6), colour="#D93F00") +
  geom_line(data=mtcars_lop, aes(x=hp, y=mpg7), colour="#027EB6")

Smoothing splines, provide more advanced technique, and stability.

Code

mtcars_fit8 <- gam(mpg~s(hp, bs="cr", k=3, m=1), data=mtcars, 
                   knots=list(seq(90, 300, length=3)))
mtcars_fit9 <- gam(mpg~s(hp, bs="cr", k=10, m=1), data=mtcars, 
                   knots=list(seq(90, 300, length=3)))
mtcars_gam <- mtcars |>
  mutate(fit8 = mtcars_fit8$fitted.values,
         fit9 = mtcars_fit9$fitted.values)
mtcars_gamp <- tibble(hp = seq(50, 450, 5))
mtcars_gamp <- mtcars_gamp |>
                mutate(mpg8 = predict(mtcars_fit8, mtcars_gamp),
                       mpg9 = predict(mtcars_fit9, mtcars_gamp))

ggplot(data=mtcars_gam, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mtcars_gamp, aes(x=hp, y=mpg8), colour="#D93F00") +
  geom_line(data=mtcars_gamp, aes(x=hp, y=mpg9), colour="#027EB6")

And are used to fit non-linear models to multiple predictors.

Logistic regression

Not all parametric models assume normally distributed errors nor continuous responses.
Logistic regression models the relationship between a set of explanatory variables \((x_{i1}, ..., x_{ik})\) and a set of binary outcomes \(Y_i\) for \(i = 1, ..., n\).
We assume that \(Y_i \sim B(1, p_i)\equiv Bernoulli(p_i)\) and the model is given by

\[\text{logit}(p_i) = \text{ln}\left(\dfrac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1x_{i1} + ... + \beta_k x_{ik}.\]

Taking the exponential of both sides and rearranging we get \[p_i = \dfrac{1}{1 + e^{-(\beta_0 + \beta_1x_{i1} + ... + \beta_k x_{ik})}}.\]
The function \(f(p) = \text{ln}\left(\dfrac{p}{1 - p}\right)\) is called the logit function, continuous with range \((-\infty, \infty)\), and if \(p\) is the probablity of an event, \(f(p)\) is the log of the odds.

Code

mtcars_ordered <- mtcars |>
  mutate(ami = as.numeric(am)-1) |>
  arrange(mpg)
mtcars_am_fit1 <- loess(ami~mpg, data=mtcars_ordered, degree=0, span=0.7)
mtcars_am_lo <- mtcars_ordered |>
  mutate(fit1 = mtcars_am_fit1$fitted)
lg1 <- ggplot(mtcars_ordered, aes(x=mpg, y=ami)) +
  geom_point() +
  geom_point(data=mtcars_am_lo, aes(x=mpg, y=fit1), colour="#D93F00") +
  geom_line(data=mtcars_am_lo, aes(x=mpg, y=fit1), colour="#D93F00") +
  ylab("automatic") +
  ggtitle("Loess")

mtcars_am_fit2 <- glm(ami~mpg, data=mtcars_ordered, family = binomial(link = "logit"))
mtcars_am_lo <- mtcars_am_lo |>
  mutate(fit2 = mtcars_am_fit2$fitted.values)
lg2 <- ggplot(mtcars_ordered, aes(x=mpg, y=ami)) +
  geom_point() +
  geom_point(data=mtcars_am_lo, aes(x=mpg, y=fit2), colour="#D93F00") +
  geom_line(data=mtcars_am_lo, aes(x=mpg, y=fit2), colour="#D93F00") +
  ylab("automatic") +
  ggtitle("Logistic")

lg1 + lg2 + plot_layout(ncol=2)

Slide a window and compute average (proportion) using loess, vs logistic function.

Time series

Trend and seasonality

Code

co2 <- read_csv("../data/CO2-monthly.csv") |>
  mutate(date = dmy(date))
co2_long <- co2 |>
  pivot_longer(ljo:spo, names_to = "stn", values_to = "co2") |>
  filter(stn %in% c("ptb", "ljo", "spo", "mlf")) |>
  mutate(stn = factor(stn, levels = c("ptb", "ljo", "mlf", "spo"))) |>
  filter(!is.na(co2)) |>
  filter(co2 < 430)
co2_p1 <- ggplot(co2_long, aes(x=date, y=co2, colour=stn)) +
  geom_point() +
  facet_wrap(~stn, ncol=2) +
  theme(legend.position="none") +
  ylim(c(330, 430)) +
  theme(aspect.ratio = 0.5)

co2_fit <- co2_long |>
  mutate(time = as.numeric(date)-7304) |>
  nest(dat = -stn) |>
  mutate(mod_trend = map(dat, ~lm(co2~time+I(time^2), data=.x)),
         mod_season = map(dat, ~lm(co2~time+I(time^2)+month, data=.x)),
        coef_trend = map(mod_trend, tidy),
        coef_seasonal = map(mod_season, tidy),
        fit_trend = map(mod_trend, augment),
        fit_season = map(mod_season, augment)) |>
    select(stn, starts_with('fit')) |> 
    pivot_longer(
        cols = starts_with('fit')
    ) |>
    unnest(value)
co2_p2 <- co2_fit |>
  filter(name == "fit_trend") |>
  ggplot(aes(x=time, y=.fitted, colour=stn)) +
  geom_point() +
  ylim(c(330, 430))

co2_p3 <- co2_fit |>
  filter(name == "fit_season") |>
  mutate(year = time %/% 356 + 1990) |>
  group_by(stn, year) |>
  mutate(.fitted_seas = .fitted - mean(.fitted, na.rm=TRUE)) |>
  ggplot(aes(x=time, y=.fitted_seas, colour=stn)) +
  geom_line() + 
  facet_wrap(~stn, ncol=2) +
  theme(aspect.ratio=0.5)

Data
Trend
Seasonality

Data from Global CO2 monitoring stations

Exploring lags

Melbourne’s temperature, from high to low!

Today plotted vertically, yesterday plotted horizontally. Different types of plots are different models applied to the data (lags).

Hyndman, Bashtannyk, Grunwald (1996)

High-dimensions

Groups

Data

A little fuzzy.

Model view

Clearer view. Misses some quirks.

Relationships

Take-aways

Models provide different lenses for extracting the patterns in the data
- Sharpen
- Exaggerate
- Hallucinate
Form a decomposition of the observed values into different strata
Provide a multitude of other numerical quantities with which to see various aspects of the data.
We are already using models, all the time, when making plots.

Resources

Cook & Weisberg (1994) An Introduction to Regression Graphics
Belsley, Kuh and Welsch (1980). Regression Diagnostics
Hyndman, Bashtannyk, Grunwald (1996) Estimating and Visualizing Conditional Densities