ETC5521: Diving Deeply into Data Exploration
Exploring data having a space and time context Part I
Outline
- What is temporal data?
- What is exploratory temporal data analysis?
- Using temporal objects in R:
tsibble
- Data wrangling: aggregation, creating temporal components, missing values
- Plotting conventions: connect the dots; aspect ratio, landscape or portrait
- Calendar plots: arranging daily records into a calendar format
- Visual inference for temporal data
- tignostics: cognostics for temporal data
- Interactive graphics for temporal data
- Exploring longitudinal data, with the
brolgar
package
Philosophy
Time series analysis is what you do after all the interesting stuff has been done!
Time series analysis focuses on modeling the temporal dependence. Data needs to have trend, seasonality, anomalies removed first.
Exploratory temporal analysis involves exploring and discovering temporal trend, patterns related to seasons, and anomalies. And possibly also unusual temporal dependence.
What is temporal data?
- Temporal data has date/time/ordering index variable, call it time.
- A time variable has special structure:
- it can have cyclical patterns, eg seasonality (summer, winter), an over in cricket
- the cyclical patterns can be nested, eg postcode within state, over within innings
- Measurements are also NOT independent - yesterday may influence today.
- It still likely has non-cyclical patterns, trends and associations with other variables, eg temperature increasing over time, over is bowled by Elise Perry or Sophie Molineaux
tsibble
: R temporal object
The tsibble
package provides a data infrastructure for tidy temporal data with wrangling tools. Adapting the tidy data principles, tsibble
is a data- and model-oriented object. In tsibble
:
- Index is a variable with inherent ordering from past to present.
- Key is a set of variables that define observational units over time.
- Each observation should be uniquely identified by index and key.
- Each observational unit should be measured at a common interval, if regularly spaced.
Regular vs irregular
The Melbourne pedestrian sensor data has a regular period. Counts are provided for every hour, at numerous locations.
options(width=55)
pedestrian
# A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
# Key: Sensor [4]
Sensor Date_Time Date Time Count
<chr> <dttm> <date> <int> <int>
1 Birrarun… 2015-01-01 00:00:00 2015-01-01 0 1630
2 Birrarun… 2015-01-01 01:00:00 2015-01-01 1 826
3 Birrarun… 2015-01-01 02:00:00 2015-01-01 2 567
4 Birrarun… 2015-01-01 03:00:00 2015-01-01 3 264
5 Birrarun… 2015-01-01 04:00:00 2015-01-01 4 139
6 Birrarun… 2015-01-01 05:00:00 2015-01-01 5 77
7 Birrarun… 2015-01-01 06:00:00 2015-01-01 6 44
8 Birrarun… 2015-01-01 07:00:00 2015-01-01 7 56
9 Birrarun… 2015-01-01 08:00:00 2015-01-01 8 113
10 Birrarun… 2015-01-01 09:00:00 2015-01-01 9 166
# ℹ 66,027 more rows
In contrast, the US flights data, below, is irregular.
options(width=55)
library(nycflights13)
<- flights |>
flights_ts mutate(dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(hour, minute, sep=":")))) |>
as_tsibble(index = dt, key = c(origin, dest, carrier, tailnum), regular = FALSE)
flights_ts
# A tsibble: 336,776 x 20 [!] <UTC>
# Key: origin, dest, carrier, tailnum [52,807]
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 30 2224 2000 144
2 2013 2 17 2012 2010 2
3 2013 2 26 2356 2000 236
4 2013 3 13 1958 2005 -7
5 2013 5 16 2214 2000 134
6 2013 5 30 2045 2000 45
7 2013 9 11 2254 2159 55
8 2013 9 12 NA 2159 NA
9 2013 9 8 2156 2159 -3
10 2013 1 26 1614 1620 -6
# ℹ 336,766 more rows
# ℹ 14 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, dt <dttm>
Is pedestrian traffic really regular?
Getting started
Wrangling prior to analysing temporal data includes:
- aggregate by temporal unit.
- construct temporal units to study seasonality, such as month, week, day of the week, quarter, …
- checking and imputing missings.
For the airlines data, you can aggregate by multiple quantities, eg number of arrivals, departures, average hourly arrival delay and departure delays.
Aggregating
The US flights data already has some temporal components created, so aggregating by these is easy. Here is departure delay.
Code
<- flights_ts |>
flights_mth as_tibble() |>
group_by(month, origin) |>
summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
as_tsibble(key=origin, index=month)
ggplot(flights_mth, aes(x=month, y=dep_delay, colour=origin)) +
geom_point() +
geom_smooth(se=F) +
scale_x_continuous("", breaks = seq(1, 12, 1),
labels=c("J","F","M","A","M","J",
"J","A","S","O","N","D")) +
scale_y_continuous("av dep delay (mins)", limits=c(0, 25)) +
theme(aspect.ratio = 0.5)
Aggregate by month, but examine arrival delays.
Code
<- flights_ts |>
flights_mth_arr as_tibble() |>
group_by(month, origin) |>
summarise(arr_delay = mean(arr_delay, na.rm=TRUE)) |>
as_tsibble(key=origin, index=month)
ggplot(flights_mth_arr, aes(x=month, y=arr_delay, colour=origin)) +
geom_point() +
geom_smooth(se=F) +
scale_x_continuous("", breaks = seq(1, 12, 1),
labels=c("J","F","M","A","M","J",
"J","A","S","O","N","D")) +
scale_y_continuous("av arr delay (mins)", limits=c(0, 25)) +
theme(aspect.ratio = 0.5)
Constructing temporal units
Week day vs weekend would be expected to have different patterns of delay, but this is not provided.
Code
<- flights_ts |>
flights_wk as_tibble() |>
mutate(wday = wday(dt, label=TRUE, week_start = 1)) |>
group_by(wday, origin) |>
summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
mutate(weekend = ifelse(wday %in% c("Sat", "Sun"), "yes", "no")) |>
as_tsibble(key=origin, index=wday)
ggplot(flights_wk, aes(x=wday, y=dep_delay, fill=weekend)) +
geom_col() +
facet_wrap(~origin, ncol=1, scales="free_y") +
xlab("") +
ylab("av dep delay (mins)") +
theme(aspect.ratio = 0.5, legend.position = "none")
Be careful of times!
Code
<- flights |>
flights_airtm mutate(dep_min = dep_time %% 100,
dep_hr = dep_time %/% 100,
arr_min = arr_time %% 100,
arr_hr = arr_time %/% 100) |>
mutate(dep_dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(dep_hr, dep_min, sep=":")))) |>
mutate(arr_dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(arr_hr, arr_min, sep=":")))) |>
mutate(air_time2 = as.numeric(difftime(arr_dt, dep_dt)))
<- flights_airtm |>
fp sample_n(3000) |>
ggplot(aes(x=air_time, y=air_time2, label = paste(origin, dest))) +
geom_abline(intercept=0, slope=1) +
geom_point()
ggplotly(fp, width=500, height=500)
Why is this not what we expect?
Checking and filling missings (1/4)
set.seed(328)
<- tsibble(
harvest year = c(2010, 2011, 2013, 2011,
2012, 2013),
fruit = rep(c("kiwi", "cherry"),
each = 3),
kilo = sample(1:10, size = 6),
key = fruit, index = year
) harvest
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
has_gaps(harvest, .full = TRUE)
# A tibble: 2 × 2
fruit .gaps
<chr> <lgl>
1 cherry TRUE
2 kiwi TRUE
Can you see the gaps in time?
Both levels of the key have missings.
Checking and filling missings (2/4)
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
count_gaps(harvest, .full=TRUE)
# A tibble: 2 × 4
fruit .from .to .n
<chr> <dbl> <dbl> <int>
1 cherry 2010 2010 1
2 kiwi 2012 2012 1
One missing in each level, although it is a different year.
Notice how tsibble
handles this summary so neatly.
Checking and filling missings (3/4)
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
Make the implicit missing values explicit.
<- fill_gaps(harvest,
harvest .full=TRUE)
harvest
# A tsibble: 8 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2010 cherry NA
2 2011 cherry 2
3 2012 cherry 7
4 2013 cherry 1
5 2010 kiwi 6
6 2011 kiwi 5
7 2012 kiwi NA
8 2013 kiwi 8
Checking and filling missings (4/4)
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
We have already seen na_ma()
function, that imputes using a moving average. Alternatively, na_interpolation()
uses the previous and next values to impute.
<- harvest |>
harvest_nomiss group_by(fruit) |>
mutate(kilo =
na_interpolation(kilo)) |>
ungroup()
harvest_nomiss
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
Conventions
- lines: connecting sequential time points reminding the reader that the temporal dependence is important.
- aspect ratio: wide or tall? Cleveland, McGill, McGill (1988) argue the average line slope in a line chart should be 45 degrees, which is called banking to 45 degrees. But this is refuted in Talbot, Gerth, Hanrahan (2012) that the conclusion was based on a flawed study. Nevertheless, aspect ratio is an inescapable skill for designing effective plots. For time series, typically a wide aspect ratio is good.
- conventions:
- time on the horizontal axis,
- ordering of elements like week day, month. Most software organises by alphabetical order, so this needs to be controlled.
Aspect ratio
- Is the trend linear or non-linear?
- Yes, slightly non-linear. We could fit a linear regression model, and examine the residuals to better assess non-linear trend.
- Is there a cyclical pattern?
- Yes, there is a yearly trend.
- Yes, there is a yearly trend.
This type of data is easy to model, and forecast.
load(here::here("data/CO2_ptb.rda"))
<- CO2.ptb |>
CO2.ptb filter(year > 1980) |>
filter(co2_ppm > 100) # handle missing values
<- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) +
p geom_line(size=1) + xlab("") + ylab("CO2 (ppm)")
<- p + theme(aspect.ratio = 1) + ggtitle("1 to 1 (may be useless)")
p1 <- p + theme(aspect.ratio = 2) + ggtitle("tall & skinny: trend")
p3 <- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) +
p2 annotate("text", x=2000, y=375, label="CO2 at \n Point Barrow,\n Alaska", size=8) + theme_solid()
<- p +
p4 scale_x_continuous("", breaks = seq(1980, 2020, 5)) +
theme(aspect.ratio = 0.2) + ggtitle("short & wide: seasonality")
grid.arrange(p1, p2, p3, p4, layout_matrix= matrix(c(1,2,3,4,4,4), nrow=2, byrow=T))
Case study: NYC flights (1/2)
- Draws the daily data in the layout of a regular calendar
- A wonderful way to get a lot of data into a page
- Easy to examine daily patterns, weekly, monthly patterns
- The daily pattern at JFK is very regular.
- It is similar for every day of the week, and for every month
- There is a peak in early flights, a drop around lunchtime and then the number of flights pick up again.
Something is fishy here. What is it?
<- flights |>
flights_hourly group_by(time_hour, origin) |>
summarise(count = n(),
dep_delay = mean(dep_delay,
na.rm = TRUE)) |>
ungroup() |>
as_tsibble(index = time_hour,
key = origin) |>
mutate(dep_delay =
na_interpolation(dep_delay))
<- flights_hourly |>
calendar_df filter(origin == "JFK") |>
mutate(hour = hour(time_hour),
date = as.Date(time_hour)) |>
filter(year(date) < 2014) |>
frame_calendar(x=hour, y=count, date=date, nrow=2)
<- calendar_df |>
p1 ggplot(aes(x = .hour, y = .count, group = date)) +
geom_line() + theme(axis.line.x = element_blank(),
axis.line.y = element_blank()) +
theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))
Case study: NYC flights (2/2)
Delays are much more interesting to examine
- Most days have few delays
- Jun and July seem to have more delays
- A few days, sporadically in the year, have big delays
Can you find a reason for one of the days with a big delay?
From ChatGPT: As of my last update in September 2021, a significant late-season snowstorm did affect parts of the United States in April 2013, but it was more focused on the Midwest rather than the Northeast where JFK Airport (John F. Kennedy International Airport) is located. The storm impacted states like Minnesota, Wisconsin, and South Dakota, among others, and brought heavy snowfall and icy conditions.
However, weather conditions can have a cascading effect on flight schedules nationwide, so it’s possible that there were some delays at JFK related to this or other weather phenomena.
<- flights_hourly |>
calendar_df filter(origin == "JFK") |>
mutate(hour = hour(time_hour),
date = as.Date(time_hour)) |>
filter(year(date) < 2014) |>
frame_calendar(x=hour, y=dep_delay, date=date, nrow=2)
<- calendar_df |>
p1 ggplot(aes(x = .hour, y = .dep_delay, group = date)) +
geom_line() + theme(axis.line.x = element_blank(),
axis.line.y = element_blank()) +
theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))
Temporal patterns: simulation
Code
<- pedestrian |>
p_bourke as_tibble() |>
filter(Sensor == "Bourke Street Mall (North)",
>= ymd("2015-05-03"), Date <= ymd("2015-05-16")) |>
Date mutate(date_num =
as.numeric(difftime(Date_Time,ymd_hms("2015-05-03 00:00:00"),
units="hours"))+11) |> # UTC to AEST
mutate(day = wday(Date, label=TRUE, week_start=1)) |>
select(date_num, Time, day, Count) |>
rename(time = date_num, hour=Time, count = Count)
# Fit a linear model with categorical hour variable
<- glm(count~day+factor(hour), family="poisson",
p_bourke_lm data=p_bourke)
# Function to simulate from a Poisson
<- function(model, newdata) {
simulate_poisson <- predict(model, newdata, type = "response")
lambda_pred rpois(length(lambda_pred), lambda = lambda_pred)
}
set.seed(436)
<- sample(1:12)
pos <- bind_cols(.sample = rep(pos[1],
p_bourke_lineup nrow(p_bourke)), p_bourke[,-2])
for (i in 1:11) {
<- simulate_poisson(p_bourke_lm, p_bourke)
new <- tibble(time=p_bourke$time, count=new)
x <- bind_cols(.sample = rep(pos[i+1],
x nrow(p_bourke)), x)
<- bind_rows(p_bourke_lineup, x)
p_bourke_lineup
}
ggplot(p_bourke_lineup,
aes(x=time, y=count)) +
geom_line() +
facet_wrap(~.sample, ncol=4) +
theme(aspect.ratio=0.5,
axis.text = element_blank(),
axis.title = element_blank())
- Decide on a model
- Simulate from the model to generate nulls
Association: permutation
Code
set.seed(514)
ggplot(lineup(null_permute("origin"), true=flights_mth, n=12),
aes(x=month, y=dep_delay, colour=origin)) +
geom_point() +
geom_smooth(se=F) +
facet_wrap(~.sample, ncol=4) +
theme(aspect.ratio = 0.5,
legend.position = "none",
axis.text = element_blank(),
axis.title = element_blank())
Break association between variables. Here origin
is permuted which breaks association with dep_delay
, while keeping month
fixed.
Which plot has the biggest difference between the three groups?
feasts
: time series features
The feasts
package provides functions to calculate tignostics for time series.
Remember scagnostics?
Compute tignostics for each series, for example,
- trend
- seasonality
- linearity
- spikiness
- peak
- trough
Code
<- tourism |>
tourism_feat features(Trips, feat_stl)
|>
tourism_feat ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
geom_point()
Interactive exploration with tsibbletalk
Code
<- tourism |>
tourism_shared as_shared_tsibble(spec = (State / Region) * Purpose)
<- tourism_shared |>
tourism_feat features(Trips, feat_stl)
<- tourism_shared |>
p1 ggplot(aes(x = Quarter, y = Trips)) +
geom_line(aes(group = Region), alpha = 0.5) +
facet_wrap(~ Purpose, scales = "free_y")
<- tourism_feat |>
p2 ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
geom_point(aes(group = Region))
subplot(
ggplotly(p1, tooltip = "Region", width = 1400, height = 700),
ggplotly(p2, tooltip = "Region", width = 1200, height = 600),
nrows = 1, widths=c(0.5, 0.5), heights=1) |>
highlight(dynamic = FALSE)
Wrapping series
Pedestrian counts at Bourke St Mall, has a daily seasonality.
DEMO
<- p_bourke |>
pp as_tsibble(index = time) |>
ggplot(aes(x=time, y=count)) +
geom_line() +
theme(aspect.ratio=0.5)
<- fluidPage(tsibbleWrapUI("tswrap"))
ui <- function(input, output, session) {
server tsibbleWrapServer("tswrap", pp, period = "1 day")
}
shinyApp(ui, server)
Famous data: Lynx
Annual numbers of lynx trappings for 1821–1934 in Canada. Almost 10 year cycle. Explore periodicity by wrapping series on itself.
DEMO
<- as_tsibble(lynx) |>
lynx_tsb rename(count = value)
<- ggplot(lynx_tsb,
pl aes(x = index, y = count)) +
geom_line(size = .2)
<- fluidPage(
ui tsibbleWrapUI("tswrap"))
<- function(input, output,
server
session) {tsibbleWrapServer("tswrap", pl,
period = "1 year")
}shinyApp(ui, server)
Longitudinal vs time series
Longitudinal data tracks the same sample of individuals at different points in time. It often has different lengths and different time points for each individual.
When the time points are the same for each individual, it is usually referred to as panel data. When different individuals are measured at each time point, it is called cross-sectional data.
Overall trend
Log(wages) of 888 individuals, measured at various times in their employment US National Longitudinal Survey of Youth.
Code
|>
wages ggplot() +
geom_line(aes(x = xp, y = ln_wages, group = id), alpha=0.1) +
geom_smooth(aes(x = xp, y = ln_wages), se=F) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6)
Wages tend to increase as time in the workforce gets longer, on average.
Code
|>
wages ggplot() +
geom_line(aes(x = xp, y = ln_wages, group = id), alpha=0.1) +
geom_smooth(aes(x = xp, y = ln_wages,
group = high_grade, colour = high_grade), se=F) +
xlab("years of experience") +
ylab("wages (log)") +
scale_colour_viridis_c("education") +
theme(aspect.ratio = 0.6)
The higher the education level achieved, the higher overall wage, on average.
Eating spaghetti
brolgar
uses tsibble
as the data object, and provides:
- sampling individuals
- longnostics for individuals
- diagnostics for statistical models
Code
set.seed(753)
|>
wages sample_n_keys(size = 10) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = as.factor(id))) +
geom_line() +
xlim(c(0,13)) + ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Code
set.seed(749)
|>
wages sample_n_keys(size = 10) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = as.factor(id))) +
geom_line() +
xlim(c(0,13)) + ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Code
set.seed(757)
|>
wages sample_n_keys(size = 10) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = as.factor(id))) +
geom_line() +
xlim(c(0,13)) + ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Few individuals experience wages like the overall trend.
Individual patterns
Remember scagnostics?
Compute longnostics for each subject, for example,
- Slope, intercept from simple linear model
- Variance, standard deviation
- Jumps, differences
Code
<- wages |>
wages_slope add_n_obs() |>
filter(n_obs > 4) |>
add_key_slope(ln_wages ~ xp) |>
as_tsibble(key = id, index = xp)
<- wages |>
wages_spread features(ln_wages, feat_spread) |>
right_join(wages_slope, by="id")
|>
wages_slope filter(.slope_xp > 0.3) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = factor(id))) +
geom_line() +
xlim(c(0, 4.5)) +
ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Code
|>
wages_slope filter(.slope_xp < (-0.4)) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = factor(id))) +
geom_line() +
xlim(c(0, 4.5)) +
ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Code
|>
wages_spread filter(sd < 0.1) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = factor(id))) +
geom_line() +
xlim(c(0, 12)) +
ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Code
|>
wages_spread filter(sd > 0.8) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = factor(id))) +
geom_line() +
xlim(c(0, 12)) +
ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Individual summaries
A different style of five number summary: What does average look like? What do extremes look like?
Find those individuals who are representative of the min, median, maximum, etc of a particular feature, e.g. trend, using keys_near()
. This reports the individual who is closest to a particular statistic.
wages_threenum()
returns the three individuals: min, max and closest to the median value, for a particular feature.
wages_fivenum()
returns the five individuals: min, max and closest to the median, Q1 and Q3 values, for a particular feature.
Code
<- wages |>
wages_fivenum add_n_obs() |>
filter(n_obs > 6) |>
key_slope(ln_wages ~ xp) |>
keys_near(key = id,
var = .slope_xp,
funs = l_five_num) |>
left_join(wages, by = "id") |>
as_tsibble(key = id, index = xp)
|>
wages_fivenum ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
ylim(c(0, 4.5)) +
facet_wrap(~stat, ncol=3) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Assessing model fits
Fit a mixed effect model, education as fixed effect, subject random effect using slope.
Summary of the fit
Code
<-
wages_fit_int lmer(ln_wages ~ xp + high_grade +
|id), data = wages)
(xp <- wages |>
wages_aug add_predictions(wages_fit_int,
var = "pred_int") |>
add_residuals(wages_fit_int,
var = "res_int")
<- ggplot(wages_aug,
m1 aes(x = xp,
y = pred_int,
group = id)) +
geom_line(alpha = 0.2) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6)
<- ggplot(wages_aug,
m2 aes(x = pred_int,
y = res_int,
group = id)) +
geom_point(alpha = 0.5) +
xlab("fitted values") + ylab("residuals")
+ m2 + plot_layout(ncol=2) m1
Diagnosing the fit: each individual model
Code
|> add_n_obs() |> filter(n_obs > 4) |>
wages_aug sample_n_keys(size = 12) |>
ggplot() +
geom_line(aes(x = xp, y = pred_int, group = id,
colour = factor(id))) +
geom_point(aes(x = xp, y = ln_wages,
colour = factor(id))) +
facet_wrap(~id, ncol=3) +
xlab("Years of experience") + ylab("Log wages") +
theme(aspect.ratio = 0.6, legend.position = "none")
Resources
- Wang, Cook, Hyndman (2019) A New Tidy Data Structure to Support Exploration and Modeling of Temporal Data
- Wang, Cook, Hyndman, O’Hara-Wild (2019) tsibble
- O’Hara-Wild, Hyndman, Wang (2020). fabletools: Core Tools for Packages in the ‘fable’ Framework
- O’Hara-Wild, Hyndman, Wang (2024). feasts: Feature Extraction and Statistics for Time Series
- Tierney, Cook, Prvan (2020) Browse Over Longitudinal Data Graphically and Analytically in R