ETC5521: Diving Deeply into Data Exploration

Sculpting data using models, checking assumptions, co-dependency and performing diagnostics

Professor Di Cook

Department of Econometrics and Business Statistics

Outline

  • Different types of model fitting
  • Decomposing data from model
    • fitted
    • residual
  • Diagnostic calculations
    • anomalies
    • leverage
    • influence

Models can be used to re-focus the view of data

Different types of model fitting

The basic form for fitting a model with data (response \(Y\) and predictors \(X\)) is:

\[ Y = f(X) + \varepsilon \]

and \(X\) could be include multiple variables, \(X = (X_{1}, X_{2}, \dots, X_{p})\) where \(p\) is the number of variables. We have a sample of \(n\) observations, \(y_i, x_{i1}, \dots x_{ip}, ~~~ i=1, \dots, n\).

  • In a parametric model, the form of \(f\) is specified, e.g. \(\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\), and one would estimate the parameters \(\beta_0, \beta_1, \beta_2, \beta_3\).
    • Frequentist fitting assumes that parameters are fixed values.
    • In a Bayesian framework, the parameters are assumed to have a distribution, e.g. Gaussian.
  • In a non-parametric model, the form of \(f\) is NOT specified but fitted from the data. May not have a specific functional form, and needs more data, typically. Imposes less assumptions. Can be done in a Bayesian framework.
  • Different types of variables can change the model specification, e.g. binary or categorical \(Y\), or temporal or spatial context.
  • Different model products, e.g. fitted values or residuals, after the fit change the lens with which we view the data.

Parametric regression

Specification

Specify the

  • functional form, e.g. function form is has linear and quadratic terms

\[f(X) = \beta_0 + \beta_1 X + \beta_2 X^2\]

  • distribution of errors, e.g. 

\[\varepsilon \sim N(0, \sigma^2)\]

Fitting results in:

  • fitted values, \(\widehat{y}\) (sharpening)
  • residuals, \(e = y-\widehat{y}\) (what did we miss)
Code
mtcars_fit <- lm(mpg~poly(hp, 2), data=mtcars)
mtcars_all <- augment(mtcars_fit, mtcars)
p1 <- ggplot(mtcars_all, aes(x=hp, y=mpg)) +
  geom_point() +
  ylim(c(10, 35)) +
  ggtitle("data")
p2 <- ggplot(mtcars_all, aes(x=hp, y=.fitted)) +
  geom_point() +
  ylab("f") +
  ylim(c(10, 35)) +
  ggtitle("model")
p1 + p2 + plot_layout(ncol=2)

Code
mtcars_coefs <- tidy(mtcars_fit)
mtcars_stats <- glance(mtcars_fit)
mtcars_coefs
# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      20.1     0.544     36.9  6.15e-26
2 poly(hp, 2)1    -26.0     3.08      -8.46 2.51e- 9
3 poly(hp, 2)2     13.2     3.08       4.27 1.89e- 4
Code
mtcars_stats
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic      p.value    df
      <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>
1     0.756         0.739  3.08      45.0      1.30e-9     2
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Diagnostics (1/3)

Residuals, \(e = y-\widehat{y}\) (what doesn’t the fitted model see?)

  • Should be consistent with a sample from the specified error model
  • Should have no relationship with the response variable
Code
set.seed(1026)
ggplot(lineup(null_dist(".resid", "norm"), n=12, true=mtcars_all), 
       aes(x=.resid)) +
  geom_histogram(binwidth=1, colour="white") +
  facet_wrap(~.sample, ncol=4) +
  theme(axis.title = element_blank(),
        axis.text = element_blank())

Code
p3 <- ggplot(mtcars_all, aes(x=.resid)) +
  geom_histogram(binwidth=1, colour="white") 
p4 <- ggplot(mtcars_all, aes(sample=.resid)) +
  geom_qq_line() +
  geom_qq() 
p5 <- ggplot(mtcars_all, aes(x=1, y=.resid)) +
  geom_quasirandom() 
p6 <- ggplot(mtcars_all, aes(x=1, y=.resid)) +
  geom_violin(fill="#D93F00", colour="white", 
              draw_quantiles=c(0.25, 0.5, 0.75)) 

p3 + p4 + p5 + p6 + plot_layout(ncol=2)

Code
set.seed(1058)
ggplot(lineup(null_dist(".resid", "norm"), n=12, true=mtcars_all), 
       aes(x=hp, y=.resid)) +
  geom_point() +
  facet_wrap(~.sample, ncol=4) +
  theme(axis.title = element_blank(),
        axis.text = element_blank())

Code
p7 <- ggplot(mtcars_all, aes(x=hp, y=.resid)) +
  geom_point() +
  ylab("e") +
  ylim(c(-3, 3)) +
  ggtitle("residuals")
mt_full_fit <- tibble(hp = seq(50, 340, 10))
mt_full_fit <- mt_full_fit |>
                mutate(mpg = predict(mtcars_fit, mt_full_fit))
p8 <- ggplot(mtcars_all, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mt_full_fit, colour="#D93F00", linewidth=2) +
  ylim(c(10, 35)) +
  ggtitle("data+f")

p7 + p8 + plot_layout(ncol=2)

Diagnostics (2/3)

Leverage

  • The matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is referred to as the hat matrix.
  • The \(i\)-th diagonal element of \(\mathbf{H}\), \(h_{ii}\), is called the leverage of the \(i\)-th observation.
  • Leverages are always between zero and one, \[0 \leq h_{ii} \leq 1.\]
  • Notice that leverages are not dependent on the response!
  • Points with high leverage can exert a lot of influence on the parameter estimates

Studentized residuals

  • In order to obtain residuals with equal variance, many texts recommend using the studentised residuals \[e_i^* = \dfrac{e_i} {\widehat{\sigma} \sqrt{1 - h_{ii}}}\] for diagnostic checks.

Cook’s distance

  • Cook’s distance, \(D\), is another measure of influence: \[\begin{eqnarray*} D_i &=& \dfrac{(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})^\top Var(\widehat{\boldsymbol{\beta}})^{-1}(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})}{p}\\ &=&\frac{e_i^2 h_{ii}}{(1-h_{ii})^2p\widehat\sigma^2}, \end{eqnarray*}\] where \(p\) is the number of elements in \(\boldsymbol{\beta}\), \(\widehat{\boldsymbol{\beta}}_{[-i]}\) and \(\widehat Y_{j[-i]}\) are least squares estimates and the fitted value obtained by fitting the model ignoring the \(i\)-th data point \((\boldsymbol{x}_i,Y_i)\), respectively.

Diagnostics (3/3)

Code
mtcars_all <- mtcars_all |>
  mutate(id = 1:n())
p10 <- ggplot(mtcars_all, aes(x=id, y=.hat, label=id)) +
  geom_point() +
  labs(title="leverage")
p11 <- ggplot(mtcars_all, aes(x=id, y=.cooksd, label=id)) +
  geom_point() +
  labs(title="cooks d") 
p12 <- ggplot(mtcars_all, aes(x=.resid, y=.std.resid, label=id)) +
  geom_point() +
  labs(title="studentised residuals")

Simulation

Generate response values for un-collected predictor values

mt_full_fit <- tibble(hp = seq(50, 340, 10))
mt_full_fit <- mt_full_fit |>
                mutate(mpg = predict(mtcars_fit, mt_full_fit))

Simulate new samples

Code
set.seed(1249)
mt_sample1 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample1 <- mt_sample1 |>
                mutate(mpg = predict(mtcars_fit, mt_sample1) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample1 <- bind_rows(mt_sample1, mtcars_sub)

ggplot(mt_sample1, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35)) 

Code
set.seed(1250)
mt_sample2 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample2 <- mt_sample2 |>
                mutate(mpg = predict(mtcars_fit, mt_sample2) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample2 <- bind_rows(mt_sample2, mtcars_sub)

ggplot(mt_sample2, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35)) 

Code
set.seed(1350)
mt_sample3 <- tibble(hp = runif(nrow(mtcars), 50, 340),
                     e = rnorm(nrow(mtcars), sd = mtcars_stats$sigma))
mt_sample3 <- mt_sample3 |>
                mutate(mpg = predict(mtcars_fit, mt_sample3) + e,
                       type = "sim") |>
  select(hp, mpg, type)
mtcars_sub <- mtcars |>
  select(hp, mpg) |>
  mutate(type = "data")
mt_sample3 <- bind_rows(mt_sample3, mtcars_sub)

ggplot(mt_sample3, aes(x=hp, y=mpg, colour=type)) +
  geom_point() +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ylim(c(10, 35)) 

What can go wrong with parametric model?

Wrong specification

Specify function form is has only linear term

\[f(X) = \beta_0 + \beta_1 X\]

Code
mtcars_fit2 <- lm(mpg~hp, data=mtcars)
mtcars_all2 <- augment(mtcars_fit2, mtcars)
sp1 <- ggplot(mtcars_all2, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_point(aes(x=hp, y=.fitted), colour="#D93F00") + 
  ylim(c(10, 35)) +
  ggtitle("data+model")
sp2 <- ggplot(mtcars_all2, aes(x=hp, y=.resid)) +
  geom_hline(yintercept=0, colour="grey60") +
  geom_point() +
  ylab("e") +
  ggtitle("residuals")
sp1 + sp2 + plot_layout(ncol=2)

Polynomial

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      20.1     0.544     36.9  6.15e-26
2 poly(hp, 2)1    -26.0     3.08      -8.46 2.51e- 9
3 poly(hp, 2)2     13.2     3.08       4.27 1.89e- 4
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic      p.value    df
      <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>
1     0.756         0.739  3.08      45.0      1.30e-9     2
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Linear

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  30.1       1.63       18.4  6.64e-18
2 hp           -0.0682    0.0101     -6.74 1.79e- 7
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>
1     0.602         0.589  3.86      45.5 0.000000179     1
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Extrapolating

Generate response values for un-collected predictor values OUTSIDE of domain of collected data, can produce HALLUCINATIONS.

Code
mtcars_fit3 <- lm(mpg~poly(hp, 5), data=mtcars)
mtcars_all3 <- augment(mtcars_fit3, mtcars)
mtcars_stats3 <- glance(mtcars_fit3)

set.seed(129)
mt_extrap <- tibble(hp = seq(250, 450, 10),
                    e = rnorm(length(hp), sd=mtcars_stats3$sigma))
mt_extrap <- mt_extrap |>
                mutate(mpg = predict(mtcars_fit3, mt_extrap) + e) |>
  mutate(type = "new") |>
  select(-e)
mt_extrap <- bind_rows(mt_extrap, mtcars_sub)

Multiple variables

Missing terms

Code
mtcars_fit4 <- lm(mpg~poly(hp, 2)+am, data=mtcars)
mtcars_all4 <- augment(mtcars_fit4, mtcars)
mtcars_coefs4 <- tidy(mtcars_fit4)
mtcars_stats4 <- glance(mtcars_fit4)
mtcars_fit5 <- lm(mpg~poly(hp, 2)*am, data=mtcars)
mtcars_all5 <- augment(mtcars_fit5, mtcars)
mtcars_coefs5 <- tidy(mtcars_fit5)
mtcars_stats5 <- glance(mtcars_fit5)
ip1 <- ggplot(mtcars_all4, aes(x=hp, y=.fitted, colour=am)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  ggtitle("No interaction") +
  theme(legend.position = "None")
ip2 <- ggplot(mtcars_all5, aes(x=hp, y=.fitted, colour=am)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  ggtitle("With interaction") +
  theme(legend.position = "None")

# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     18.6      0.669     27.8  6.37e-22
2 poly(hp, 2)1   -23.5      2.79      -8.43 3.58e- 9
3 poly(hp, 2)2     7.88     3.14       2.51 1.80e- 2
4 ammanual         3.75     1.16       3.22 3.20e- 3
# A tibble: 6 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             18.4      0.784    23.5   5.05e-19
2 poly(hp, 2)1           -20.4      5.02     -4.07  3.93e- 4
3 poly(hp, 2)2             7.02     7.09      0.990 3.32e- 1
4 ammanual                 3.55     1.22      2.92  7.11e- 3
5 poly(hp, 2)1:ammanu…    -5.97     6.36     -0.940 3.56e- 1
6 poly(hp, 2)2:ammanu…     2.93     8.12      0.361 7.21e- 1

Non-parametric model

Smoothing splines

We’ve seen loess, which fits a linear model in a sliding window over predictor, where span controls size of window.

Code
mtcars_fit6 <- loess(mpg~hp, data=mtcars, degree=1, span=0.2)
mtcars_fit7 <- loess(mpg~hp, data=mtcars, degree=1, span=0.5)
mtcars_lo <- mtcars |>
  mutate(fit6 = mtcars_fit6$fitted,
         fit7 = mtcars_fit7$fitted)
mtcars_lop <- tibble(hp = seq(50, 340, 5))
mtcars_lop <- mtcars_lop |>
                mutate(mpg6 = predict(mtcars_fit6, mtcars_lop),
                       mpg7 = predict(mtcars_fit7, mtcars_lop))

ggplot(data=mtcars_lo, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mtcars_lop, aes(x=hp, y=mpg6), colour="#D93F00") +
  geom_line(data=mtcars_lop, aes(x=hp, y=mpg7), colour="#027EB6") 

Smoothing splines, provide more advanced technique, and stability.

Code
mtcars_fit8 <- gam(mpg~s(hp, bs="cr", k=3, m=1), data=mtcars, 
                   knots=list(seq(90, 300, length=3)))
mtcars_fit9 <- gam(mpg~s(hp, bs="cr", k=10, m=1), data=mtcars, 
                   knots=list(seq(90, 300, length=3)))
mtcars_gam <- mtcars |>
  mutate(fit8 = mtcars_fit8$fitted.values,
         fit9 = mtcars_fit9$fitted.values)
mtcars_gamp <- tibble(hp = seq(50, 450, 5))
mtcars_gamp <- mtcars_gamp |>
                mutate(mpg8 = predict(mtcars_fit8, mtcars_gamp),
                       mpg9 = predict(mtcars_fit9, mtcars_gamp))

ggplot(data=mtcars_gam, aes(x=hp, y=mpg)) +
  geom_point() +
  geom_line(data=mtcars_gamp, aes(x=hp, y=mpg8), colour="#D93F00") +
  geom_line(data=mtcars_gamp, aes(x=hp, y=mpg9), colour="#027EB6") 

And are used to fit non-linear models to multiple predictors.

Logistic regression

  • Not all parametric models assume normally distributed errors nor continuous responses.
  • Logistic regression models the relationship between a set of explanatory variables \((x_{i1}, ..., x_{ik})\) and a set of binary outcomes \(Y_i\) for \(i = 1, ..., n\).
  • We assume that \(Y_i \sim B(1, p_i)\equiv Bernoulli(p_i)\) and the model is given by

\[\text{logit}(p_i) = \text{ln}\left(\dfrac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1x_{i1} + ... + \beta_k x_{ik}.\]

  • Taking the exponential of both sides and rearranging we get \[p_i = \dfrac{1}{1 + e^{-(\beta_0 + \beta_1x_{i1} + ... + \beta_k x_{ik})}}.\]

  • The function \(f(p) = \text{ln}\left(\dfrac{p}{1 - p}\right)\) is called the logit function, continuous with range \((-\infty, \infty)\), and if \(p\) is the probablity of an event, \(f(p)\) is the log of the odds.

Code
mtcars_ordered <- mtcars |>
  mutate(ami = as.numeric(am)-1) |>
  arrange(mpg)
mtcars_am_fit1 <- loess(ami~mpg, data=mtcars_ordered, degree=0, span=0.7)
mtcars_am_lo <- mtcars_ordered |>
  mutate(fit1 = mtcars_am_fit1$fitted)
lg1 <- ggplot(mtcars_ordered, aes(x=mpg, y=ami)) +
  geom_point() +
  geom_point(data=mtcars_am_lo, aes(x=mpg, y=fit1), colour="#D93F00") +
  geom_line(data=mtcars_am_lo, aes(x=mpg, y=fit1), colour="#D93F00") +
  ylab("automatic") +
  ggtitle("Loess")

mtcars_am_fit2 <- glm(ami~mpg, data=mtcars_ordered, family = binomial(link = "logit"))
mtcars_am_lo <- mtcars_am_lo |>
  mutate(fit2 = mtcars_am_fit2$fitted.values)
lg2 <- ggplot(mtcars_ordered, aes(x=mpg, y=ami)) +
  geom_point() +
  geom_point(data=mtcars_am_lo, aes(x=mpg, y=fit2), colour="#D93F00") +
  geom_line(data=mtcars_am_lo, aes(x=mpg, y=fit2), colour="#D93F00") +
  ylab("automatic") +
  ggtitle("Logistic")

lg1 + lg2 + plot_layout(ncol=2)

Slide a window and compute average (proportion) using loess, vs logistic function.

Time series

Trend and seasonality

Code
co2 <- read_csv("../data/CO2-monthly.csv") |>
  mutate(date = dmy(date))
co2_long <- co2 |>
  pivot_longer(ljo:spo, names_to = "stn", values_to = "co2") |>
  filter(stn %in% c("ptb", "ljo", "spo", "mlf")) |>
  mutate(stn = factor(stn, levels = c("ptb", "ljo", "mlf", "spo"))) |>
  filter(!is.na(co2)) |>
  filter(co2 < 430)
co2_p1 <- ggplot(co2_long, aes(x=date, y=co2, colour=stn)) +
  geom_point() +
  facet_wrap(~stn, ncol=2) +
  theme(legend.position="none") +
  ylim(c(330, 430)) +
  theme(aspect.ratio = 0.5)

co2_fit <- co2_long |>
  mutate(time = as.numeric(date)-7304) |>
  nest(dat = -stn) |>
  mutate(mod_trend = map(dat, ~lm(co2~time+I(time^2), data=.x)),
         mod_season = map(dat, ~lm(co2~time+I(time^2)+month, data=.x)),
        coef_trend = map(mod_trend, tidy),
        coef_seasonal = map(mod_season, tidy),
        fit_trend = map(mod_trend, augment),
        fit_season = map(mod_season, augment)) |>
    select(stn, starts_with('fit')) |> 
    pivot_longer(
        cols = starts_with('fit')
    ) |>
    unnest(value)
co2_p2 <- co2_fit |>
  filter(name == "fit_trend") |>
  ggplot(aes(x=time, y=.fitted, colour=stn)) +
  geom_point() +
  ylim(c(330, 430))

co2_p3 <- co2_fit |>
  filter(name == "fit_season") |>
  mutate(year = time %/% 356 + 1990) |>
  group_by(stn, year) |>
  mutate(.fitted_seas = .fitted - mean(.fitted, na.rm=TRUE)) |>
  ggplot(aes(x=time, y=.fitted_seas, colour=stn)) +
  geom_line() + 
  facet_wrap(~stn, ncol=2) +
  theme(aspect.ratio=0.5)

Data from Global CO2 monitoring stations

Exploring lags

Melbourne’s temperature, from high to low!

Today plotted vertically, yesterday plotted horizontally. Different types of plots are different models applied to the data (lags).

Hyndman, Bashtannyk, Grunwald (1996)

High-dimensions

Groups

Data

A little fuzzy.

Model view

Clearer view. Misses some quirks.

Relationships

Take-aways

  • Models provide different lenses for extracting the patterns in the data
    • Sharpen
    • Exaggerate
    • Hallucinate
  • Form a decomposition of the observed values into different strata
  • Provide a multitude of other numerical quantities with which to see various aspects of the data.
  • We are already using models, all the time, when making plots.

Resources

  • Cook & Weisberg (1994) An Introduction to Regression Graphics
  • Belsley, Kuh and Welsch (1980). Regression Diagnostics
  • Hyndman, Bashtannyk, Grunwald (1996) Estimating and Visualizing Conditional Densities