ETC5521 Worksheet Week 3

Initial data analysis

Author

Prof. Di Cook

🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

install.packages(c("tidyverse", "ggbeeswarm", "broom", "visdat"))

🧩 Tasks

1. Take a glimpse of the penguins data. What types are variables are present in the data?

2. How was this data collected? You will need to read the documentation for the palmerpenguins package, or see if AI knows.

3. Using the visdat package make an overview plot to examine types of variables and for missing values.

4. Check the distributions of each species on each of the size variables, using a jittered dotplot, using the geom_quasirandom() function in the ggbeeswarm package. There seems to be some bimodality in some species on some variables eg bill_len. Why do you think this might be? Check your thinking by making a suitable plot.

5. Is there any indication of outliers from the jittered dotplots of different variables?

6. Make a scatterplot of body_mass_g vs flipper_length_mm for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?

7. How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs flipper_length_mm. From the residual plot, are there any concerns about the model fit?