ETC5521 Worksheet Week 3

Initial data analysis

Author

Prof. Di Cook

🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

install.packages(c("tidyverse", "ggbeeswarm", "broom", "visdat"))

🧩 Tasks

1. Take a glimpse of the penguins data. What types are variables are present in the data?

Code
library(tidyverse)
glimpse(penguins)
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39, 40, 40, NA, 37, 39, 39, 39, 34, 42, 38, 38, 41, 39, 35…
$ bill_dep    <dbl> 19, 17, 18, NA, 19, 21, 18, 20, 18, 20, 17, 17, 18, 21, 21…
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

2. How was this data collected? You will need to read the documentation for the palmerpenguins package, or see if AI knows.

Details are at https://allisonhorst.github.io/palmerpenguins/articles/intro.html, and you learn “These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal.” It is necessary to also read the original data collection article https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081 to obtain details of the sampling. Breeding pairs of penguins were included based on sampling of nests where pairs of adults were present, were chosen and marked, before onset of egg-laying.

3. Using the visdat package make an overview plot to examine types of variables and for missing values.

There are three factor variables - species, island, sex - and three integer variables - flipper_len, body_mass and year - and two numeric variables - bill_len and bill_dep.

It is interesting that flipper_len and body_mass are reported without decimal places, and thus are integers, whereas bill_len and bill_dep are reported with one decimal place, and thus are doubles. Both are numeric variables, though.

Four variables have missing values: sex, bill_len, bill_dep, flipper_len, body_mass. There are more missings on sex. The other missings occur together, the penguins are missing on all five variables.

Code
library(visdat)
vis_dat(penguins)

Code
p_nomiss <- penguins |>
  select(species, bill_len, bill_dep, 
         flipper_len, body_mass) |>
  na.omit()

4. Check the distributions of each species on each of the size variables, using a jittered dotplot, using the geom_quasirandom() function in the ggbeeswarm package. There seems to be some bimodality in some species on some variables eg bill_len. Why do you think this might be? Check your thinking by making a suitable plot.

Particularly on bill_len multimodality can be seen in the Chinstrap and Gentoo penguins. This corresponds to differences in the two species. Differences can be seen in the sexes for all of the variables and species, but it is only big enough in bill_len to be noticeable as bimodality.

Code
library(ggbeeswarm)
ggplot(penguins, aes(x=species, 
                     y=bill_len)) +
  geom_quasirandom() 

Code
ggplot(penguins, aes(x=species, 
                     y=bill_len, 
                     colour=sex)) +
  geom_quasirandom() +
  scale_color_brewer("", palette="Dark2")

5. Is there any indication of outliers from the jittered dotplots of different variables?

Outliers can be seen on bill_len for Chinstrap and Gentoo. Interestingly, there is one female penguins with a really big bill, bigger than all the males even.

Also on flipper_len there is one female penguins with a much smaller value than others.

6. Make a scatterplot of body_mass_g vs flipper_length_mm for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?

The striping corresponds to rounded values of flipper length, that have been reported to the nearest mm. It appears that it is done across all measuring as it is present for both sexes, for all species, for each island and each year. Otherwise there is nothing much to report as unusual.

Code
penguins |>
  select(flipper_len, body_mass, species) |>
  na.omit() |>
  ggplot(aes(x=flipper_len,
             y=body_mass, 
             colour=species)) +
  geom_point() +
  theme(aspect.ratio=1)

Code
penguins |>
  na.omit() |>
  ggplot(aes(x=flipper_len,
             y=body_mass)) +
  geom_point() +
  facet_wrap(~island, ncol=3, scales = "free") +
  theme(aspect.ratio=1)

Code
penguins |>
  na.omit() |>
  ggplot(aes(x=flipper_len,
             y=body_mass)) +
  geom_point() +
  facet_wrap(~sex, ncol=2, scales = "free") +
  theme(aspect.ratio=1)

Code
penguins |>
  na.omit() |>
  ggplot(aes(x=flipper_len,
             y=body_mass)) +
  geom_point() +
  facet_wrap(~species, ncol=3, scales = "free") +
  theme(aspect.ratio=1)

Code
penguins |>
  na.omit() |>
  ggplot(aes(x=flipper_len,
             y=body_mass)) +
  geom_point() +
  facet_wrap(~year, ncol=3, scales = "free") +
  theme(aspect.ratio=1)

⚠️ Note: That using na.omit() is VERY DANGEROUS. You should very rarely use it and only in very clear situations like this data.

7. How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs flipper_length_mm. From the residual plot, are there any concerns about the model fit?

The model fit statistics suggest it is a reasonably good model, with flipper length explaining about 76% of body mass. From the estimated standard deviation of the error, \(\sigma=393\), we could say that the estimated body mass is likely accurate to within 800g. (Assuming normal distribution and 95% of observations within two \(\sigma\).)

The residual suggests no major problems. There is a little heteroskedasticity. Perhaps if you look carefully, though, it might indicate that different models should have been fitted for the smaller penguins and the bigger penguins, perhaps models separately for sex and species would be advised.

Code
library(broom)
penguins_nona <- penguins |>
  select(flipper_len, body_mass, species) |>
  na.omit()
penguins_fit <- lm(body_mass~flipper_len,
                   data=penguins_nona)
tidy(penguins_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -5781.     306.       -18.9 5.59e- 55
2 flipper_len     49.7      1.52      32.7 4.37e-107
Code
glance(penguins_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.759         0.758  394.     1071. 4.37e-107     1 -2528. 5063. 5074.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Code
penguins_m <- augment(penguins_fit)
ggplot(penguins_m, aes(x=flipper_len, y=.resid)) +
  geom_hline(yintercept=1, colour="grey70") +
  geom_point() +
  theme(aspect.ratio=1)