install.packages(c("tidyverse", "ggbeeswarm", "broom", "visdat"))
ETC5521 Worksheet Week 3
Initial data analysis
🎯 Objectives
Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.
Initial data analysis
Prof. Di Cook
Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.
glimpse
of the penguins
data. What types are variables are present in the data?Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len <dbl> 39, 40, 40, NA, 37, 39, 39, 39, 34, 42, 38, 38, 41, 39, 35…
$ bill_dep <dbl> 19, 17, 18, NA, 19, 21, 18, 20, 18, 20, 17, 17, 18, 21, 21…
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex <fct> male, female, female, NA, female, male, female, male, NA, …
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
palmerpenguins
package, or see if AI knows.Details are at https://allisonhorst.github.io/palmerpenguins/articles/intro.html, and you learn “These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal.” It is necessary to also read the original data collection article https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081 to obtain details of the sampling. Breeding pairs of penguins were included based on sampling of nests where pairs of adults were present, were chosen and marked, before onset of egg-laying.
visdat
package make an overview plot to examine types of variables and for missing values.There are three factor variables - species, island, sex - and three integer variables - flipper_len, body_mass and year - and two numeric variables - bill_len and bill_dep.
It is interesting that flipper_len and body_mass are reported without decimal places, and thus are integers, whereas bill_len and bill_dep are reported with one decimal place, and thus are doubles. Both are numeric variables, though.
Four variables have missing values: sex, bill_len, bill_dep, flipper_len, body_mass. There are more missings on sex. The other missings occur together, the penguins are missing on all five variables.
geom_quasirandom()
function in the ggbeeswarm
package. There seems to be some bimodality in some species on some variables eg bill_len
. Why do you think this might be? Check your thinking by making a suitable plot.Particularly on bill_len
multimodality can be seen in the Chinstrap and Gentoo penguins. This corresponds to differences in the two species. Differences can be seen in the sexes for all of the variables and species, but it is only big enough in bill_len
to be noticeable as bimodality.
Outliers can be seen on bill_len
for Chinstrap and Gentoo. Interestingly, there is one female penguins with a really big bill, bigger than all the males even.
Also on flipper_len
there is one female penguins with a much smaller value than others.
body_mass_g
vs flipper_length_mm
for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?The striping corresponds to rounded values of flipper length, that have been reported to the nearest mm. It appears that it is done across all measuring as it is present for both sexes, for all species, for each island and each year. Otherwise there is nothing much to report as unusual.
⚠️ Note: That using na.omit()
is VERY DANGEROUS. You should very rarely use it and only in very clear situations like this data.
flipper_length_mm
. From the residual plot, are there any concerns about the model fit?The model fit statistics suggest it is a reasonably good model, with flipper length explaining about 76% of body mass. From the estimated standard deviation of the error, \(\sigma=393\), we could say that the estimated body mass is likely accurate to within 800g. (Assuming normal distribution and 95% of observations within two \(\sigma\).)
The residual suggests no major problems. There is a little heteroskedasticity. Perhaps if you look carefully, though, it might indicate that different models should have been fitted for the smaller penguins and the bigger penguins, perhaps models separately for sex and species would be advised.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5781. 306. -18.9 5.59e- 55
2 flipper_len 49.7 1.52 32.7 4.37e-107
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.759 0.758 394. 1071. 4.37e-107 1 -2528. 5063. 5074.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>