Exploring data having a space and time context Part I
Department of Econometrics and Business Statistics
tsibble
brolgar
packageTime series analysis is what you do after all the interesting stuff has been done!
Time series analysis focuses on modeling the temporal dependence. Data needs to have trend, seasonality, anomalies removed first.
Exploratory temporal analysis involves exploring and discovering temporal trend, patterns related to seasons, and anomalies. And possibly also unusual temporal dependence.
tsibble
: R temporal object
The tsibble
package provides a data infrastructure for tidy temporal data with wrangling tools. Adapting the tidy data principles, tsibble
is a data- and model-oriented object. In tsibble
:
The Melbourne pedestrian sensor data has a regular period. Counts are provided for every hour, at numerous locations.
# A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
# Key: Sensor [4]
Sensor Date_Time Date Time Count
<chr> <dttm> <date> <int> <int>
1 Birrarun… 2015-01-01 00:00:00 2015-01-01 0 1630
2 Birrarun… 2015-01-01 01:00:00 2015-01-01 1 826
3 Birrarun… 2015-01-01 02:00:00 2015-01-01 2 567
4 Birrarun… 2015-01-01 03:00:00 2015-01-01 3 264
5 Birrarun… 2015-01-01 04:00:00 2015-01-01 4 139
6 Birrarun… 2015-01-01 05:00:00 2015-01-01 5 77
7 Birrarun… 2015-01-01 06:00:00 2015-01-01 6 44
8 Birrarun… 2015-01-01 07:00:00 2015-01-01 7 56
9 Birrarun… 2015-01-01 08:00:00 2015-01-01 8 113
10 Birrarun… 2015-01-01 09:00:00 2015-01-01 9 166
# ℹ 66,027 more rows
In contrast, the US flights data, below, is irregular.
options(width=55)
library(nycflights13)
flights_ts <- flights |>
mutate(dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(hour, minute, sep=":")))) |>
as_tsibble(index = dt, key = c(origin, dest, carrier, tailnum), regular = FALSE)
flights_ts
# A tsibble: 336,776 x 20 [!] <UTC>
# Key: origin, dest, carrier, tailnum [52,807]
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 30 2224 2000 144
2 2013 2 17 2012 2010 2
3 2013 2 26 2356 2000 236
4 2013 3 13 1958 2005 -7
5 2013 5 16 2214 2000 134
6 2013 5 30 2045 2000 45
7 2013 9 11 2254 2159 55
8 2013 9 12 NA 2159 NA
9 2013 9 8 2156 2159 -3
10 2013 1 26 1614 1620 -6
# ℹ 336,766 more rows
# ℹ 14 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, dt <dttm>
Is pedestrian traffic really regular?
Wrangling prior to analysing temporal data includes:
For the airlines data, you can aggregate by multiple quantities, eg number of arrivals, departures, average hourly arrival delay and departure delays.
The US flights data already has some temporal components created, so aggregating by these is easy. Here is departure delay.
flights_mth <- flights_ts |>
as_tibble() |>
group_by(month, origin) |>
summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
as_tsibble(key=origin, index=month)
ggplot(flights_mth, aes(x=month, y=dep_delay, colour=origin)) +
geom_point() +
geom_smooth(se=F) +
scale_x_continuous("", breaks = seq(1, 12, 1),
labels=c("J","F","M","A","M","J",
"J","A","S","O","N","D")) +
scale_y_continuous("av dep delay (mins)", limits=c(0, 25)) +
theme(aspect.ratio = 0.5)
Aggregate by month, but examine arrival delays.
flights_mth_arr <- flights_ts |>
as_tibble() |>
group_by(month, origin) |>
summarise(arr_delay = mean(arr_delay, na.rm=TRUE)) |>
as_tsibble(key=origin, index=month)
ggplot(flights_mth_arr, aes(x=month, y=arr_delay, colour=origin)) +
geom_point() +
geom_smooth(se=F) +
scale_x_continuous("", breaks = seq(1, 12, 1),
labels=c("J","F","M","A","M","J",
"J","A","S","O","N","D")) +
scale_y_continuous("av arr delay (mins)", limits=c(0, 25)) +
theme(aspect.ratio = 0.5)
Week day vs weekend would be expected to have different patterns of delay, but this is not provided.
flights_wk <- flights_ts |>
as_tibble() |>
mutate(wday = wday(dt, label=TRUE, week_start = 1)) |>
group_by(wday, origin) |>
summarise(dep_delay = mean(dep_delay, na.rm=TRUE)) |>
mutate(weekend = ifelse(wday %in% c("Sat", "Sun"), "yes", "no")) |>
as_tsibble(key=origin, index=wday)
ggplot(flights_wk, aes(x=wday, y=dep_delay, fill=weekend)) +
geom_col() +
facet_wrap(~origin, ncol=1, scales="free_y") +
xlab("") +
ylab("av dep delay (mins)") +
theme(aspect.ratio = 0.5, legend.position = "none")
Be careful of times!
flights_airtm <- flights |>
mutate(dep_min = dep_time %% 100,
dep_hr = dep_time %/% 100,
arr_min = arr_time %% 100,
arr_hr = arr_time %/% 100) |>
mutate(dep_dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(dep_hr, dep_min, sep=":")))) |>
mutate(arr_dt = ymd_hm(paste(paste(year, month, day, sep="-"),
paste(arr_hr, arr_min, sep=":")))) |>
mutate(air_time2 = as.numeric(difftime(arr_dt, dep_dt)))
fp <- flights_airtm |>
sample_n(3000) |>
ggplot(aes(x=air_time, y=air_time2, label = paste(origin, dest))) +
geom_abline(intercept=0, slope=1) +
geom_point()
ggplotly(fp, width=500, height=500)
Why is this not what we expect?
set.seed(328)
harvest <- tsibble(
year = c(2010, 2011, 2013, 2011,
2012, 2013),
fruit = rep(c("kiwi", "cherry"),
each = 3),
kilo = sample(1:10, size = 6),
key = fruit, index = year
)
harvest
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
# A tsibble: 6 x 3 [1Y]
# Key: fruit [2]
year fruit kilo
<dbl> <chr> <int>
1 2011 cherry 2
2 2012 cherry 7
3 2013 cherry 1
4 2010 kiwi 6
5 2011 kiwi 5
6 2013 kiwi 8
We have already seen na_ma()
function, that imputes using a moving average. Alternatively, na_interpolation()
uses the previous and next values to impute.
This type of data is easy to model, and forecast.
load(here::here("data/CO2_ptb.rda"))
CO2.ptb <- CO2.ptb |>
filter(year > 1980) |>
filter(co2_ppm > 100) # handle missing values
p <- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) +
geom_line(size=1) + xlab("") + ylab("CO2 (ppm)")
p1 <- p + theme(aspect.ratio = 1) + ggtitle("1 to 1 (may be useless)")
p3 <- p + theme(aspect.ratio = 2) + ggtitle("tall & skinny: trend")
p2 <- ggplot(CO2.ptb, aes(x=date, y=co2_ppm)) +
annotate("text", x=2000, y=375, label="CO2 at \n Point Barrow,\n Alaska", size=8) + theme_solid()
p4 <- p +
scale_x_continuous("", breaks = seq(1980, 2020, 5)) +
theme(aspect.ratio = 0.2) + ggtitle("short & wide: seasonality")
grid.arrange(p1, p2, p3, p4, layout_matrix= matrix(c(1,2,3,4,4,4), nrow=2, byrow=T))
Something is fishy here. What is it?
flights_hourly <- flights |>
group_by(time_hour, origin) |>
summarise(count = n(),
dep_delay = mean(dep_delay,
na.rm = TRUE)) |>
ungroup() |>
as_tsibble(index = time_hour,
key = origin) |>
mutate(dep_delay =
na_interpolation(dep_delay))
calendar_df <- flights_hourly |>
filter(origin == "JFK") |>
mutate(hour = hour(time_hour),
date = as.Date(time_hour)) |>
filter(year(date) < 2014) |>
frame_calendar(x=hour, y=count, date=date, nrow=2)
p1 <- calendar_df |>
ggplot(aes(x = .hour, y = .count, group = date)) +
geom_line() + theme(axis.line.x = element_blank(),
axis.line.y = element_blank()) +
theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))
Delays are much more interesting to examine
Can you find a reason for one of the days with a big delay?
From ChatGPT: As of my last update in September 2021, a significant late-season snowstorm did affect parts of the United States in April 2013, but it was more focused on the Midwest rather than the Northeast where JFK Airport (John F. Kennedy International Airport) is located. The storm impacted states like Minnesota, Wisconsin, and South Dakota, among others, and brought heavy snowfall and icy conditions.
However, weather conditions can have a cascading effect on flight schedules nationwide, so it’s possible that there were some delays at JFK related to this or other weather phenomena.
calendar_df <- flights_hourly |>
filter(origin == "JFK") |>
mutate(hour = hour(time_hour),
date = as.Date(time_hour)) |>
filter(year(date) < 2014) |>
frame_calendar(x=hour, y=dep_delay, date=date, nrow=2)
p1 <- calendar_df |>
ggplot(aes(x = .hour, y = .dep_delay, group = date)) +
geom_line() + theme(axis.line.x = element_blank(),
axis.line.y = element_blank()) +
theme(aspect.ratio=0.5)
prettify(p1, size = 3, label.padding = unit(0.15, "lines"))
p_bourke <- pedestrian |>
as_tibble() |>
filter(Sensor == "Bourke Street Mall (North)",
Date >= ymd("2015-05-03"), Date <= ymd("2015-05-16")) |>
mutate(date_num =
as.numeric(difftime(Date_Time,ymd_hms("2015-05-03 00:00:00"),
units="hours"))+11) |> # UTC to AEST
mutate(day = wday(Date, label=TRUE, week_start=1)) |>
select(date_num, Time, day, Count) |>
rename(time = date_num, hour=Time, count = Count)
# Fit a linear model with categorical hour variable
p_bourke_lm <- glm(count~day+factor(hour), family="poisson",
data=p_bourke)
# Function to simulate from a Poisson
simulate_poisson <- function(model, newdata) {
lambda_pred <- predict(model, newdata, type = "response")
rpois(length(lambda_pred), lambda = lambda_pred)
}
set.seed(436)
pos <- sample(1:12)
p_bourke_lineup <- bind_cols(.sample = rep(pos[1],
nrow(p_bourke)), p_bourke[,-2])
for (i in 1:11) {
new <- simulate_poisson(p_bourke_lm, p_bourke)
x <- tibble(time=p_bourke$time, count=new)
x <- bind_cols(.sample = rep(pos[i+1],
nrow(p_bourke)), x)
p_bourke_lineup <- bind_rows(p_bourke_lineup, x)
}
ggplot(p_bourke_lineup,
aes(x=time, y=count)) +
geom_line() +
facet_wrap(~.sample, ncol=4) +
theme(aspect.ratio=0.5,
axis.text = element_blank(),
axis.title = element_blank())
Break association between variables. Here origin
is permuted which breaks association with dep_delay
, while keeping month
fixed.
Which plot has the biggest difference between the three groups?
feasts
: time series features
The feasts
package provides functions to calculate tignostics for time series.
Remember scagnostics?
Compute tignostics for each series, for example,
tsibbletalk
tourism_shared <- tourism |>
as_shared_tsibble(spec = (State / Region) * Purpose)
tourism_feat <- tourism_shared |>
features(Trips, feat_stl)
p1 <- tourism_shared |>
ggplot(aes(x = Quarter, y = Trips)) +
geom_line(aes(group = Region), alpha = 0.5) +
facet_wrap(~ Purpose, scales = "free_y")
p2 <- tourism_feat |>
ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
geom_point(aes(group = Region))
subplot(
ggplotly(p1, tooltip = "Region", width = 1400, height = 700),
ggplotly(p2, tooltip = "Region", width = 1200, height = 600),
nrows = 1, widths=c(0.5, 0.5), heights=1) |>
highlight(dynamic = FALSE)
Pedestrian counts at Bourke St Mall, has a daily seasonality.
DEMO
Famous data: Lynx
Annual numbers of lynx trappings for 1821–1934 in Canada. Almost 10 year cycle. Explore periodicity by wrapping series on itself.
DEMO
Longitudinal data tracks the same sample of individuals at different points in time. It often has different lengths and different time points for each individual.
When the time points are the same for each individual, it is usually referred to as panel data. When different individuals are measured at each time point, it is called cross-sectional data.
Log(wages) of 888 individuals, measured at various times in their employment US National Longitudinal Survey of Youth.
Wages tend to increase as time in the workforce gets longer, on average.
The higher the education level achieved, the higher overall wage, on average.
brolgar
uses tsibble
as the data object, and provides:
Few individuals experience wages like the overall trend.
Remember scagnostics?
Compute longnostics for each subject, for example,
wages_slope <- wages |>
add_n_obs() |>
filter(n_obs > 4) |>
add_key_slope(ln_wages ~ xp) |>
as_tsibble(key = id, index = xp)
wages_spread <- wages |>
features(ln_wages, feat_spread) |>
right_join(wages_slope, by="id")
wages_slope |>
filter(.slope_xp > 0.3) |>
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = factor(id))) +
geom_line() +
xlim(c(0, 4.5)) +
ylim(c(0, 4.5)) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
A different style of five number summary: What does average look like? What do extremes look like?
Find those individuals who are representative of the min, median, maximum, etc of a particular feature, e.g. trend, using keys_near()
. This reports the individual who is closest to a particular statistic.
wages_threenum()
returns the three individuals: min, max and closest to the median value, for a particular feature.
wages_fivenum()
returns the five individuals: min, max and closest to the median, Q1 and Q3 values, for a particular feature.
wages_fivenum <- wages |>
add_n_obs() |>
filter(n_obs > 6) |>
key_slope(ln_wages ~ xp) |>
keys_near(key = id,
var = .slope_xp,
funs = l_five_num) |>
left_join(wages, by = "id") |>
as_tsibble(key = id, index = xp)
wages_fivenum |>
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
ylim(c(0, 4.5)) +
facet_wrap(~stat, ncol=3) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6, legend.position = "none")
Fit a mixed effect model, education as fixed effect, subject random effect using slope.
Summary of the fit
wages_fit_int <-
lmer(ln_wages ~ xp + high_grade +
(xp |id), data = wages)
wages_aug <- wages |>
add_predictions(wages_fit_int,
var = "pred_int") |>
add_residuals(wages_fit_int,
var = "res_int")
m1 <- ggplot(wages_aug,
aes(x = xp,
y = pred_int,
group = id)) +
geom_line(alpha = 0.2) +
xlab("years of experience") +
ylab("wages (log)") +
theme(aspect.ratio = 0.6)
m2 <- ggplot(wages_aug,
aes(x = pred_int,
y = res_int,
group = id)) +
geom_point(alpha = 0.5) +
xlab("fitted values") + ylab("residuals")
m1 + m2 + plot_layout(ncol=2)
Diagnosing the fit: each individual model
wages_aug |> add_n_obs() |> filter(n_obs > 4) |>
sample_n_keys(size = 12) |>
ggplot() +
geom_line(aes(x = xp, y = pred_int, group = id,
colour = factor(id))) +
geom_point(aes(x = xp, y = ln_wages,
colour = factor(id))) +
facet_wrap(~id, ncol=3) +
xlab("Years of experience") + ylab("Log wages") +
theme(aspect.ratio = 0.6, legend.position = "none")
ETC5521 Lecture 9 | ddde.numbat.space