ETC5521: Diving Deeply into Data Exploration

Exploring data having a space and time context Part I

Professor Di Cook

Department of Econometrics and Business Statistics

Outline

  • What is temporal data?
  • What is exploratory temporal data analysis?
  • Using temporal objects in R: tsibble
  • Data wrangling: aggregation, creating temporal components, missing values
  • Plotting conventions: connect the dots; aspect ratio, landscape or portrait
  • Calendar plots: arranging daily records into a calendar format
  • Visual inference for temporal data
  • tignostics: cognostics for temporal data
  • Interactive graphics for temporal data
  • Exploring longitudinal data, with the brolgar package

Philosophy

Time series analysis is what you do after all the interesting stuff has been done!

Heike Hofmann, 2005

Time series analysis focuses on modeling the temporal dependence. Data needs to have trend, seasonality, anomalies removed first.

Exploratory temporal analysis involves exploring and discovering temporal trend, patterns related to seasons, and anomalies. And possibly also unusual temporal dependence.

What is temporal data?

  • Temporal data has date/time/ordering index variable, call it time.
  • A time variable has special structure:
    • it can have cyclical patterns, eg seasonality (summer, winter), an over in cricket
    • the cyclical patterns can be nested, eg postcode within state, over within innings
  • Measurements are also NOT independent - yesterday may influence today.
  • It still likely has non-cyclical patterns, trends and associations with other variables, eg temperature increasing over time, over is bowled by Elise Perry or Sophie Molineaux

tsibble: R temporal object





The tsibble package provides a data infrastructure for tidy temporal data with wrangling tools. Adapting the tidy data principles, tsibble is a data- and model-oriented object. In tsibble:

  • Index is a variable with inherent ordering from past to present.
  • Key is a set of variables that define observational units over time.
  • Each observation should be uniquely identified by index and key.
  • Each observational unit should be measured at a common interval, if regularly spaced.

Regular vs irregular

The Melbourne pedestrian sensor data has a regular period. Counts are provided for every hour, at numerous locations.

options(width=55)
pedestrian 
# A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
# Key:       Sensor [4]
   Sensor    Date_Time           Date        Time Count
   <chr>     <dttm>              <date>     <int> <int>
 1 Birrarun… 2015-01-01 00:00:00 2015-01-01     0  1630
 2 Birrarun… 2015-01-01 01:00:00 2015-01-01     1   826
 3 Birrarun… 2015-01-01 02:00:00 2015-01-01     2   567
 4 Birrarun… 2015-01-01 03:00:00 2015-01-01     3   264
 5 Birrarun… 2015-01-01 04:00:00 2015-01-01     4   139
 6 Birrarun… 2015-01-01 05:00:00 2015-01-01     5    77
 7 Birrarun… 2015-01-01 06:00:00 2015-01-01     6    44
 8 Birrarun… 2015-01-01 07:00:00 2015-01-01     7    56
 9 Birrarun… 2015-01-01 08:00:00 2015-01-01     8   113
10 Birrarun… 2015-01-01 09:00:00 2015-01-01     9   166
# ℹ 66,027 more rows

In contrast, the US flights data, below, is irregular.

options(width=55)
library(nycflights13)
flights_ts <- flights |>
  mutate(dt = ymd_hm(paste(paste(year, month, day, sep="-"), 
                           paste(hour, minute, sep=":")))) |>
  as_tsibble(index = dt, key = c(origin, dest, carrier, tailnum), regular = FALSE)
flights_ts 
# A tsibble: 336,776 x 20 [!] <UTC>
# Key:       origin, dest, carrier, tailnum [52,807]
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1    30     2224           2000       144
 2  2013     2    17     2012           2010         2
 3  2013     2    26     2356           2000       236
 4  2013     3    13     1958           2005        -7
 5  2013     5    16     2214           2000       134
 6  2013     5    30     2045           2000        45
 7  2013     9    11     2254           2159        55
 8  2013     9    12       NA           2159        NA
 9  2013     9     8     2156           2159        -3
10  2013     1    26     1614           1620        -6
# ℹ 336,766 more rows
# ℹ 14 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, dt <dttm>

Is pedestrian traffic really regular?

Getting started

Wrangling prior to analysing temporal data includes:

  • aggregate by temporal unit.
  • construct temporal units to study seasonality, such as month, week, day of the week, quarter, …
  • checking and imputing missings.


For the airlines data, you can aggregate by multiple quantities, eg number of arrivals, departures, average hourly arrival delay and departure delays.

Aggregating

The US flights data already has some temporal components created, so aggregating by these is easy. Here is departure delay.

Code
flights_mth <- flights_ts |> 
  as_tibble() |>
  group_by(month, origin) |>
  summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
  as_tsibble(key=origin, index=month)
ggplot(flights_mth, aes(x=month, y=dep_delay, colour=origin)) +
  geom_point() +
  geom_smooth(se=F) +
  scale_x_continuous("", breaks = seq(1, 12, 1), 
                     labels=c("J","F","M","A","M","J",
                              "J","A","S","O","N","D")) +
  scale_y_continuous("av dep delay (mins)", limits=c(0, 25)) +
  theme(aspect.ratio = 0.5)

Aggregate by month, but examine arrival delays.

Code
flights_mth_arr <- flights_ts |> 
  as_tibble() |>
  group_by(month, origin) |>
  summarise(arr_delay = mean(arr_delay, na.rm=TRUE)) |>
  as_tsibble(key=origin, index=month)
ggplot(flights_mth_arr, aes(x=month, y=arr_delay, colour=origin)) +
  geom_point() +
  geom_smooth(se=F) +
  scale_x_continuous("", breaks = seq(1, 12, 1), 
                     labels=c("J","F","M","A","M","J",
                              "J","A","S","O","N","D")) +
  scale_y_continuous("av arr delay (mins)", limits=c(0, 25)) +
  theme(aspect.ratio = 0.5)

Constructing temporal units

Week day vs weekend would be expected to have different patterns of delay, but this is not provided.

Code
flights_wk <- flights_ts |> 
  as_tibble() |>
  mutate(wday = wday(dt, label=TRUE, week_start = 1)) |>
  group_by(wday, origin) |>
  summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
  mutate(weekend = ifelse(wday %in% c("Sat", "Sun"), "yes", "no")) |>
  as_tsibble(key=origin, index=wday)
ggplot(flights_wk, aes(x=wday, y=dep_delay, fill=weekend)) +
  geom_col() +
  facet_wrap(~origin, ncol=1, scales="free_y") +
  xlab("") +
  ylab("av dep delay (mins)") +
  theme(aspect.ratio = 0.5, legend.position = "none")

Be careful of times!

Code
flights_airtm <- flights |>
  mutate(dep_min = dep_time %% 100,
         dep_hr = dep_time %/% 100,
         arr_min = arr_time %% 100,
         arr_hr = arr_time %/% 100) |>
  mutate(dep_dt = ymd_hm(paste(paste(year, month, day, sep="-"), 
                           paste(dep_hr, dep_min, sep=":")))) |>
  mutate(arr_dt = ymd_hm(paste(paste(year, month, day, sep="-"), 
                           paste(arr_hr, arr_min, sep=":")))) |>
  mutate(air_time2 = as.numeric(difftime(arr_dt, dep_dt)))

fp <- flights_airtm |> 
  sample_n(3000) |>
  ggplot(aes(x=air_time, y=air_time2, label = paste(origin, dest))) + 
    geom_abline(intercept=0, slope=1) +
    geom_point()
ggplotly(fp, width=500, height=500)

Why is this not what we expect?

Checking and filling missings (1/4)

set.seed(328)
harvest <- tsibble(
  year = c(2010, 2011, 2013, 2011, 
           2012, 2013),
  fruit = rep(c("kiwi", "cherry"), 
              each = 3),
  kilo = sample(1:10, size = 6),
  key = fruit, index = year
)
harvest
# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     2
2  2012 cherry     7
3  2013 cherry     1
4  2010 kiwi       6
5  2011 kiwi       5
6  2013 kiwi       8
has_gaps(harvest, .full = TRUE) 
# A tibble: 2 × 2
  fruit  .gaps
  <chr>  <lgl>
1 cherry TRUE 
2 kiwi   TRUE 


Can you see the gaps in time?

Both levels of the key have missings.

Checking and filling missings (2/4)

# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     2
2  2012 cherry     7
3  2013 cherry     1
4  2010 kiwi       6
5  2011 kiwi       5
6  2013 kiwi       8
count_gaps(harvest,  .full=TRUE)
# A tibble: 2 × 4
  fruit  .from   .to    .n
  <chr>  <dbl> <dbl> <int>
1 cherry  2010  2010     1
2 kiwi    2012  2012     1


One missing in each level, although it is a different year.



Notice how tsibble handles this summary so neatly.

Checking and filling missings (3/4)

# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     2
2  2012 cherry     7
3  2013 cherry     1
4  2010 kiwi       6
5  2011 kiwi       5
6  2013 kiwi       8

Make the implicit missing values explicit.

harvest <- fill_gaps(harvest, 
                     .full=TRUE) 
harvest 
# A tsibble: 8 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2010 cherry    NA
2  2011 cherry     2
3  2012 cherry     7
4  2013 cherry     1
5  2010 kiwi       6
6  2011 kiwi       5
7  2012 kiwi      NA
8  2013 kiwi       8

Checking and filling missings (4/4)

# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     2
2  2012 cherry     7
3  2013 cherry     1
4  2010 kiwi       6
5  2011 kiwi       5
6  2013 kiwi       8

We have already seen na_ma() function, that imputes using a moving average. Alternatively, na_interpolation() uses the previous and next values to impute.

harvest_nomiss <- harvest |> 
  group_by(fruit) |> 
  mutate(kilo = 
    na_interpolation(kilo)) |> 
  ungroup()
harvest_nomiss 
# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     2
2  2012 cherry     7
3  2013 cherry     1
4  2010 kiwi       6
5  2011 kiwi       5
6  2013 kiwi       8

Plotting conventions

Conventions

  • lines: connecting sequential time points reminding the reader that the temporal dependence is important.
  • aspect ratio: wide or tall? Cleveland, McGill, McGill (1988) argue the average line slope in a line chart should be 45 degrees, which is called banking to 45 degrees. But this is refuted in Talbot, Gerth, Hanrahan (2012) that the conclusion was based on a flawed study. Nevertheless, aspect ratio is an inescapable skill for designing effective plots. For time series, typically a wide aspect ratio is good.
  • conventions:
    • time on the horizontal axis,
    • ordering of elements like week day, month. Most software organises by alphabetical order, so this needs to be controlled.

Aspect ratio

  • Is the trend linear or non-linear?
    • Yes, slightly non-linear. We could fit a linear regression model, and examine the residuals to better assess non-linear trend.
  • Is there a cyclical pattern?
    • Yes, there is a yearly trend.


This type of data is easy to model, and forecast.

load(here::here("data/CO2_ptb.rda"))
CO2.ptb <- CO2.ptb |> 
  filter(year > 1980) |>
  filter(co2_ppm > 100) # handle missing values
p <- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) + 
  geom_line(size=1) + xlab("") + ylab("CO2 (ppm)")
p1 <- p + theme(aspect.ratio = 1) + ggtitle("1 to 1 (may be useless)")
p3 <- p + theme(aspect.ratio = 2) + ggtitle("tall & skinny:  trend")
p2 <- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) + 
  annotate("text", x=2000, y=375, label="CO2 at \n Point Barrow,\n Alaska", size=8) + theme_solid()
p4 <- p + 
  scale_x_continuous("", breaks = seq(1980, 2020, 5)) + 
  theme(aspect.ratio = 0.2) + ggtitle("short & wide: seasonality")
grid.arrange(p1, p2, p3, p4, layout_matrix= matrix(c(1,2,3,4,4,4), nrow=2, byrow=T))

Calendar plot

Case study: NYC flights (1/2)

  • Draws the daily data in the layout of a regular calendar
  • A wonderful way to get a lot of data into a page
  • Easy to examine daily patterns, weekly, monthly patterns
  • The daily pattern at JFK is very regular.
  • It is similar for every day of the week, and for every month
  • There is a peak in early flights, a drop around lunchtime and then the number of flights pick up again.

Something is fishy here. What is it?

flights_hourly <- flights |>
  group_by(time_hour, origin) |> 
  summarise(count = n(), 
    dep_delay = mean(dep_delay, 
                     na.rm = TRUE)) |> 
  ungroup() |>
  as_tsibble(index = time_hour, 
             key = origin) |>
    mutate(dep_delay = 
    na_interpolation(dep_delay)) 
calendar_df <- flights_hourly |> 
  filter(origin == "JFK") |>
  mutate(hour = hour(time_hour), 
         date = as.Date(time_hour)) |>
  filter(year(date) < 2014) |>
  frame_calendar(x=hour, y=count, date=date, nrow=2) 
p1 <- calendar_df |>
  ggplot(aes(x = .hour, y = .count, group = date)) +
  geom_line() + theme(axis.line.x = element_blank(),
                      axis.line.y = element_blank()) +
  theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))

Case study: NYC flights (2/2)

Delays are much more interesting to examine

  • Most days have few delays
  • Jun and July seem to have more delays
  • A few days, sporadically in the year, have big delays

Can you find a reason for one of the days with a big delay?

From ChatGPT: As of my last update in September 2021, a significant late-season snowstorm did affect parts of the United States in April 2013, but it was more focused on the Midwest rather than the Northeast where JFK Airport (John F. Kennedy International Airport) is located. The storm impacted states like Minnesota, Wisconsin, and South Dakota, among others, and brought heavy snowfall and icy conditions.

However, weather conditions can have a cascading effect on flight schedules nationwide, so it’s possible that there were some delays at JFK related to this or other weather phenomena.

calendar_df <- flights_hourly |> 
  filter(origin == "JFK") |>
  mutate(hour = hour(time_hour), 
         date = as.Date(time_hour)) |>
  filter(year(date) < 2014) |>
  frame_calendar(x=hour, y=dep_delay, date=date, nrow=2) 
p1 <- calendar_df |>
  ggplot(aes(x = .hour, y = .dep_delay, group = date)) +
  geom_line() + theme(axis.line.x = element_blank(),
                      axis.line.y = element_blank()) +
  theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))

Visual inference

Temporal patterns: simulation

Code
p_bourke <- pedestrian |>
  as_tibble() |>
  filter(Sensor == "Bourke Street Mall (North)",
         Date >= ymd("2015-05-03"), Date <= ymd("2015-05-16")) |>
  mutate(date_num = 
    as.numeric(difftime(Date_Time,ymd_hms("2015-05-03 00:00:00"),
       units="hours"))+11) |> # UTC to AEST
  mutate(day = wday(Date, label=TRUE, week_start=1)) |>
  select(date_num, Time, day, Count) |>
  rename(time = date_num, hour=Time, count = Count)
# Fit a linear model with categorical hour variable
p_bourke_lm <- glm(count~day+factor(hour), family="poisson", 
  data=p_bourke)
# Function to simulate from a Poisson
simulate_poisson <- function(model, newdata) {
  lambda_pred <- predict(model, newdata, type = "response")
  rpois(length(lambda_pred), lambda = lambda_pred)
}

set.seed(436)
pos <- sample(1:12)
p_bourke_lineup <- bind_cols(.sample = rep(pos[1], 
  nrow(p_bourke)), p_bourke[,-2])
for (i in 1:11) {
  new <- simulate_poisson(p_bourke_lm, p_bourke)
  x <- tibble(time=p_bourke$time, count=new)
  x <- bind_cols(.sample = rep(pos[i+1], 
         nrow(p_bourke)), x)
  p_bourke_lineup <- bind_rows(p_bourke_lineup, x)
}

ggplot(p_bourke_lineup,
  aes(x=time, y=count)) + 
  geom_line() +
  facet_wrap(~.sample, ncol=4) +
  theme(aspect.ratio=0.5, 
        axis.text = element_blank(),
        axis.title = element_blank())

  1. Decide on a model
  2. Simulate from the model to generate nulls

Association: permutation

Code
set.seed(514)
ggplot(lineup(null_permute("origin"), true=flights_mth, n=12), 
       aes(x=month, y=dep_delay, colour=origin)) +
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~.sample, ncol=4) +
  theme(aspect.ratio = 0.5, 
        legend.position = "none",
        axis.text = element_blank(),
        axis.title = element_blank())

Break association between variables. Here origin is permuted which breaks association with dep_delay, while keeping month fixed.

Which plot has the biggest difference between the three groups?

Tignostics

feasts: time series features







The feasts package provides functions to calculate tignostics for time series.

Remember scagnostics?

Compute tignostics for each series, for example,

  • trend
  • seasonality
  • linearity
  • spikiness
  • peak
  • trough
Code
tourism_feat <- tourism |>
  features(Trips, feat_stl)
tourism_feat |>
  ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
  geom_point()  

Interactivity

Interactive exploration with tsibbletalk

Code
tourism_shared <- tourism |>
  as_shared_tsibble(spec = (State / Region) * Purpose)

tourism_feat <- tourism_shared |>
  features(Trips, feat_stl)

p1 <- tourism_shared |>
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line(aes(group = Region), alpha = 0.5) +
  facet_wrap(~ Purpose, scales = "free_y") 
p2 <- tourism_feat |>
  ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
  geom_point(aes(group = Region))
  
subplot(
    ggplotly(p1, tooltip = "Region", width = 1400, height = 700),
    ggplotly(p2, tooltip = "Region", width = 1200, height = 600),
    nrows = 1, widths=c(0.5, 0.5), heights=1) |>
  highlight(dynamic = FALSE)

Wrapping series

Pedestrian counts at Bourke St Mall, has a daily seasonality.

DEMO

pp <- p_bourke |>
        as_tsibble(index = time) |>
        ggplot(aes(x=time, y=count)) + 
          geom_line() +
          theme(aspect.ratio=0.5)
 
  
ui <- fluidPage(tsibbleWrapUI("tswrap"))
server <- function(input, output, session) {
  tsibbleWrapServer("tswrap", pp, period = "1 day")
}

shinyApp(ui, server)

Famous data: Lynx

Annual numbers of lynx trappings for 1821–1934 in Canada. Almost 10 year cycle. Explore periodicity by wrapping series on itself.

DEMO

lynx_tsb <- as_tsibble(lynx) |>
  rename(count = value)
pl <- ggplot(lynx_tsb, 
  aes(x = index, y = count)) +
  geom_line(size = .2) 

ui <- fluidPage(
  tsibbleWrapUI("tswrap"))
server <- function(input, output, 
                   session) {
  tsibbleWrapServer("tswrap", pl, 
       period = "1 year")
}
shinyApp(ui, server)

Longitudinal data

Longitudinal vs time series

Longitudinal data tracks the same sample of individuals at different points in time. It often has different lengths and different time points for each individual.

When the time points are the same for each individual, it is usually referred to as panel data. When different individuals are measured at each time point, it is called cross-sectional data.

Overall trend

Log(wages) of 888 individuals, measured at various times in their employment US National Longitudinal Survey of Youth.

Code
wages |>
  ggplot() +
    geom_line(aes(x = xp, y = ln_wages, group = id), alpha=0.1) +
    geom_smooth(aes(x = xp, y = ln_wages), se=F) +
    xlab("years of experience") +
    ylab("wages (log)") +
  theme(aspect.ratio = 0.6)

Wages tend to increase as time in the workforce gets longer, on average.

Code
wages |>
  ggplot() +
    geom_line(aes(x = xp, y = ln_wages, group = id), alpha=0.1) +
    geom_smooth(aes(x = xp, y = ln_wages, 
      group = high_grade, colour = high_grade), se=F) +
    xlab("years of experience") +
    ylab("wages (log)") +
  scale_colour_viridis_c("education") +
  theme(aspect.ratio = 0.6)

The higher the education level achieved, the higher overall wage, on average.

Eating spaghetti

brolgar uses tsibble as the data object, and provides:

  • sampling individuals
  • longnostics for individuals
  • diagnostics for statistical models
Code
set.seed(753)
wages |>
  sample_n_keys(size = 10) |> 
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id,
             colour = as.factor(id))) + 
  geom_line() +
  xlim(c(0,13)) + ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Code
set.seed(749)
wages |>
  sample_n_keys(size = 10) |> 
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id,
             colour = as.factor(id))) + 
  geom_line() +
  xlim(c(0,13)) + ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Code
set.seed(757)
wages |>
  sample_n_keys(size = 10) |> 
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id,
             colour = as.factor(id))) + 
  geom_line() +
  xlim(c(0,13)) + ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Few individuals experience wages like the overall trend.

Individual patterns

Remember scagnostics?

Compute longnostics for each subject, for example,

  • Slope, intercept from simple linear model
  • Variance, standard deviation
  • Jumps, differences
Code
wages_slope <- wages |>   
  add_n_obs() |>
  filter(n_obs > 4) |>
  add_key_slope(ln_wages ~ xp) |> 
  as_tsibble(key = id, index = xp) 
wages_spread <- wages |>
  features(ln_wages, feat_spread) |>
  right_join(wages_slope, by="id")

wages_slope |> 
  filter(.slope_xp > 0.3) |> 
  ggplot(aes(x = xp, 
             y = ln_wages, 
             group = id,
             colour = factor(id))) + 
  geom_line() +
  xlim(c(0, 4.5)) +
  ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Code
wages_slope |> 
  filter(.slope_xp < (-0.4)) |> 
  ggplot(aes(x = xp, 
             y = ln_wages, 
             group = id,
             colour = factor(id))) + 
  geom_line() +
  xlim(c(0, 4.5)) +
  ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Code
wages_spread |> 
  filter(sd < 0.1) |> 
  ggplot(aes(x = xp, 
             y = ln_wages, 
             group = id,
             colour = factor(id))) + 
  geom_line() +
  xlim(c(0, 12)) +
  ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Code
wages_spread |> 
  filter(sd > 0.8) |> 
  ggplot(aes(x = xp, 
             y = ln_wages, 
             group = id,
             colour = factor(id))) + 
  geom_line() +
  xlim(c(0, 12)) +
  ylim(c(0, 4.5)) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Individual summaries

A different style of five number summary: What does average look like? What do extremes look like?

Find those individuals who are representative of the min, median, maximum, etc of a particular feature, e.g. trend, using keys_near(). This reports the individual who is closest to a particular statistic.

wages_threenum() returns the three individuals: min, max and closest to the median value, for a particular feature.

wages_fivenum() returns the five individuals: min, max and closest to the median, Q1 and Q3 values, for a particular feature.

Code
wages_fivenum <- wages |>   
  add_n_obs() |>
  filter(n_obs > 6) |>
  key_slope(ln_wages ~ xp) |>
  keys_near(key = id,
            var = .slope_xp,
            funs = l_five_num) |> 
  left_join(wages, by = "id") |>
  as_tsibble(key = id, index = xp) 
  
wages_fivenum |>
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line() + 
  ylim(c(0, 4.5)) +
  facet_wrap(~stat, ncol=3) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Assessing model fits

Fit a mixed effect model, education as fixed effect, subject random effect using slope.

Summary of the fit

Code
wages_fit_int <- 
  lmer(ln_wages ~ xp + high_grade + 
         (xp |id), data = wages) 
wages_aug <- wages |>
  add_predictions(wages_fit_int, 
                  var = "pred_int") |>
  add_residuals(wages_fit_int, 
                var = "res_int")
  
m1 <- ggplot(wages_aug,
       aes(x = xp,
           y = pred_int,
           group = id)) + 
  geom_line(alpha = 0.2) +
  xlab("years of experience") +
  ylab("wages (log)") +
  theme(aspect.ratio = 0.6)
  
m2 <- ggplot(wages_aug,
       aes(x = pred_int,
           y = res_int,
           group = id)) + 
  geom_point(alpha = 0.5) +
  xlab("fitted values") + ylab("residuals")  

m1 + m2 + plot_layout(ncol=2) 

Diagnosing the fit: each individual model

Code
wages_aug |> add_n_obs() |> filter(n_obs > 4) |>
  sample_n_keys(size = 12) |>
  ggplot() + 
  geom_line(aes(x = xp, y = pred_int, group = id, 
             colour = factor(id))) + 
  geom_point(aes(x = xp, y = ln_wages, 
                 colour = factor(id))) + 
  facet_wrap(~id, ncol=3)  +
  xlab("Years of experience") + ylab("Log wages") +
  theme(aspect.ratio = 0.6, legend.position = "none")

Resources