install.packages(c("tidyverse", "here", "tsibble", "lubridate", "DAAG", "broom", "patchwork", "colorspace", "GGally", "tsibbledata", "forcats", "chron", "sugrrants", "brolgar"))
ETC5521 Tutorial 9
Exploring data having a space and time context Part I
🎯 Objectives
These exercise are to do some exploratory analysis with graphics and statistical models, focusing on temporal data analysis.
🔧 Preparation
- The reading for this week is Reintroducing tsibble: data tools that melt the clock and brolgar: An R package to BRowse Over Longitudinal Data Graphically and Analytically in R.
- Complete the weekly quiz, before the deadline!
- Install the following R-packages if you do not have them already:
- Open your RStudio Project for this unit, (the one you created in week 1,
ETC5521
). Create a.qmd
document for this weeks activities.
📥 Exercises
Exercise 1: Australian rain
This exercise is based on one from Unwin (2015), and uses the bomregions
data from the DAAG
package. The data contains regional rainfall for the years 1900-2008. The regional rainfall numbers are area-weighted averages for the respective regions. Extract just the rainfall columns from the data, along with year.
- What do you think area-weighted averages are, and how would these be calculated?
- Make line plots of the rainfall for each of the regions, the states and the Australian averages. What do you learn about rainfall patterns across the years and regions?
- It can be difficult to assess correlation between multiple series using line plots, and the best way to check correlation between multiple series is to make a scatterplot. Make a splom for this data, ignoring year. What regions have strong positive correlation between their rainfall averages?
- One of the consequences of climate change for Australia is that some regions are likely getting drier. Make a transformation of the data to compute the difference between rainfall average in the year, and the mean over all years. Using a bar for each year, make a barchart that examines the differences in the yearly rainfall over time. (Hint: you will need to pivot the data into tidy long form to make this easier.) Are there some regions who have negative differences in recent years? What else do you notice?
Exercise 2: Imputing missings for pedestrian sensor using a model
Sometimes imputing by a simple method such as mean or moving average doesn’t work well with multiple seasonality in a time series. Here we will use a linear model to capture the seasonality and produce better imputations for the pedestrian sensor data (from the tsibble
package). This data has counts for four sensors, for two years 2015-2016.
- What are the multiple seasons of the pedestrian sensor data, for
QV Market-Elizabeth St (West)
? (Hint: Make a plot to check. You might filter to a single month to make it easier to see seasonality. You might also want to check when Queen Victoria Market is open.)
- Check temporal gaps for all the pedestrian sensor data. Subset to just the QV market sensor for the two years. Where are the missing values? Fill these with NA. (Note that
fill_gaps
doesn’t fill in the additional variables,Date
,Time
, andyear
,month
so these will need to be computed after filling.)
- Create a new variable to indicate if a day is a non-working day, called
hol
. We need this to accurately model the differences between pedestrian patterns on working vs not working days. Make hour a factor - this helps to make a simple model for a non-standard daily pattern.
- Fit a linear model with Count as the response on predictors
Time
andhol
interacted.
- Predict the count for all the data at the sensor.
- Make a line plot focusing on the last two weeks in 2015, where there was a day of missings, where the missing counts are substituted by the model predictions. Do you think that these imputed values match the rest of the series, nicely?
👌 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.