Sculpting data using models, checking assumptions, co-dependency and performing diagnostics
Professor Di Cook
Department of Econometrics and Business Statistics
Outline
Different types of model fitting
Decomposing data from model
fitted
residual
Diagnostic calculations
anomalies
leverage
influence
Models can be used to re-focus the view of data
Different types of model fitting
The basic form for fitting a model with data (response \(Y\) and predictors \(X\)) is:
\[
Y = f(X) + \varepsilon
\]
and \(X\) could be include multiple variables, \(X = (X_{1}, X_{2}, \dots, X_{p})\) where \(p\) is the number of variables. We have a sample of \(n\) observations, \(y_i, x_{i1}, \dots x_{ip}, ~~~ i=1, \dots, n\).
In a parametric model, the form of \(f\) is specified, e.g. \(\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\), and one would estimate the parameters \(\beta_0, \beta_1, \beta_2, \beta_3\).
Frequentist fitting assumes that parameters are fixed values.
In a Bayesian framework, the parameters are assumed to have a distribution, e.g. Gaussian.
In a non-parametric model, the form of \(f\) is NOT specified but fitted from the data. May not have a specific functional form, and needs more data, typically. Imposes less assumptions. Can be done in a Bayesian framework.
Different types of variables can change the model specification, e.g. binary or categorical \(Y\), or temporal or spatial context.
Different model products, e.g. fitted values or residuals, after the fit change the lens with which we view the data.
Parametric regression
Specification
Specify the
functional form, e.g. function form is has linear and quadratic terms
\[f(X) = \beta_0 + \beta_1 X + \beta_2 X^2\]
distribution of errors, e.g.
\[\varepsilon \sim N(0, \sigma^2)\]
Fitting results in:
fitted values, \(\widehat{y}\) (sharpening)
residuals, \(e = y-\widehat{y}\) (what did we miss)
The matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is referred to as the hat matrix.
The \(i\)-th diagonal element of \(\mathbf{H}\), \(h_{ii}\), is called the leverage of the \(i\)-th observation.
Leverages are always between zero and one, \[0 \leq h_{ii} \leq 1.\]
Notice that leverages are not dependent on the response!
Points with high leverage can exert a lot of influence on the parameter estimates
Studentized residuals
In order to obtain residuals with equal variance, many texts recommend using the studentised residuals\[e_i^* = \dfrac{e_i} {\widehat{\sigma} \sqrt{1 - h_{ii}}}\] for diagnostic checks.
Cook’s distance
Cook’s distance, \(D\), is another measure of influence: \[\begin{eqnarray*}
D_i &=& \dfrac{(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})^\top Var(\widehat{\boldsymbol{\beta}})^{-1}(\widehat{\boldsymbol{\beta}}- \widehat{\boldsymbol{\beta}}_{[-i]})}{p}\\
&=&\frac{e_i^2 h_{ii}}{(1-h_{ii})^2p\widehat\sigma^2},
\end{eqnarray*}\] where \(p\) is the number of elements in \(\boldsymbol{\beta}\), \(\widehat{\boldsymbol{\beta}}_{[-i]}\) and \(\widehat Y_{j[-i]}\) are least squares estimates and the fitted value obtained by fitting the model ignoring the \(i\)-th data point \((\boldsymbol{x}_i,Y_i)\), respectively.
And are used to fit non-linear models to multiple predictors.
Logistic regression
Not all parametric models assume normally distributed errors nor continuous responses.
Logistic regression models the relationship between a set of explanatory variables \((x_{i1}, ..., x_{ik})\) and a set of binary outcomes\(Y_i\) for \(i = 1, ..., n\).
We assume that \(Y_i \sim B(1, p_i)\equiv Bernoulli(p_i)\) and the model is given by
Taking the exponential of both sides and rearranging we get \[p_i = \dfrac{1}{1 + e^{-(\beta_0 + \beta_1x_{i1} + ... + \beta_k x_{ik})}}.\]
The function \(f(p) = \text{ln}\left(\dfrac{p}{1 - p}\right)\) is called the logit function, continuous with range \((-\infty, \infty)\), and if \(p\) is the probablity of an event, \(f(p)\) is the log of the odds.