ETC5521: Diving Deeply into Data Exploration

Exploring data having a space and time context Part II

Professor Di Cook

Department of Econometrics and Business Statistics







You show me continents, I see the islands,

You count the centuries, I blink my eyes

~Björk

Outline

  • Breaking up data by time, and by space
  • Changing focus:
    • Maps of space over time
    • Exploring time over space with glyph maps
  • Inference for spatial trends
  • A flash back to the 1970s: Tukey’s median polish
  • Working with spatial polygon data
    • Making a choropleth map
    • Bending the choropleth into a cartogram
    • Tiling spatial regions

Approach

With spatiotemporaral data, you need to be able to pivot to focus on (1) time, (2) space, and (3) both together, although this latter task is harder.

The cubble object in the cubble R package, makes these operations easier.

Example: Temperature change

6 years of monthly measurements of a 24x24 spatial grid from Central America collated by Paul Murrell, U. Auckland.

Rows: 41,472
Columns: 15
$ time        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ id          <chr> "1-1", "1-2", "1-3", "1-4", "1-5", "1-…
$ lat         <dbl> -21, -21, -21, -21, -21, -21, -21, -21…
$ long        <dbl> -114, -111, -109, -106, -104, -101, -9…
$ elevation   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ date        <dttm> 1995-01-01, 1995-01-01, 1995-01-01, 1…
$ cloudlow    <dbl> 31, 32, 32, 39, 48, 50, 51, 52, 54, 56…
$ cloudmid    <dbl> 2.0, 2.5, 3.5, 4.0, 4.5, 2.5, 4.5, 5.0…
$ cloudhigh   <dbl> 0.5, 1.5, 1.5, 1.0, 0.5, 0.0, 0.0, 0.0…
$ ozone       <int> 260, 260, 260, 258, 258, 258, 256, 258…
$ pressure    <int> 1000, 1000, 1000, 1000, 1000, 1000, 10…
$ surftemp    <dbl> 297, 297, 297, 297, 296, 296, 296, 296…
$ temperature <dbl> 297, 296, 296, 296, 296, 295, 296, 295…
$ month       <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan…
$ year        <int> 1995, 1995, 1995, 1995, 1995, 1995, 19…
Code
nasa_cb <- as_cubble(as_tibble(nasa), 
                     key=id, 
                     index=time, 
                     coords=c(long, lat))
nasa_cb
# cubble:   key: id [576], index: time, nested form
# spatial:  [-113.75, -21.25, -56.25, 36.25], Missing CRS!
# temporal: time [int], date [dttm], cloudlow [dbl],
#   cloudmid [dbl], cloudhigh [dbl], ozone [int], pressure
#   [int], surftemp [dbl], temperature [dbl], month [ord],
#   year [int]
   id      lat   long elevation ts                
   <chr> <dbl>  <dbl>     <dbl> <list>            
 1 1-1   -21.2 -114.          0 <tibble [72 × 11]>
 2 1-2   -21.2 -111.          0 <tibble [72 × 11]>
 3 1-3   -21.2 -109.          0 <tibble [72 × 11]>
 4 1-4   -21.2 -106.          0 <tibble [72 × 11]>
 5 1-5   -21.2 -104.          0 <tibble [72 × 11]>
 6 1-6   -21.2 -101.          0 <tibble [72 × 11]>
 7 1-7   -21.2  -98.8         0 <tibble [72 × 11]>
 8 1-8   -21.2  -96.2         0 <tibble [72 × 11]>
 9 1-9   -21.2  -93.8         0 <tibble [72 × 11]>
10 1-10  -21.2  -91.2         0 <tibble [72 × 11]>
# ℹ 566 more rows

Spatial and temporal

Code
ggplot() + 
  geom_point(data=nasa_cb, aes(x=long, y=lat)) +
  geom_point(data=dplyr::filter(nasa_cb, 
       id == "5-20"),
       aes(x=long, y=lat),
       colour="orange", size=4) +
  geom_point(data=dplyr::filter(nasa_cb, 
       id == "20-2"),
       aes(x=long, y=lat),
       colour="turquoise", size=4)

Code
nasa_cb_f <- nasa_cb |> 
  face_temporal() 
ggplot(nasa_cb_f) + 
  geom_line(aes(x=date, 
                 y=surftemp, 
                 group=id), alpha=0.2) +
  geom_line(data=filter(nasa_cb_f , 
       id=="5-20"),
       aes(x=date, 
                 y=surftemp, 
                 group=id),
       colour="orange", linewidth=2) +
  geom_line(data=filter(nasa_cb_f , 
       id=="20-2"),
       aes(x=date, 
                 y=surftemp, 
                 group=id),
       colour="turquoise", linewidth=2) +
  theme(aspect.ratio = 0.5)

Pre-processing of time and space

Think of time and space as ordered categorical variables.

  • Time may need to be converted to categories.
  • Spatial variable might need to be discretised, or gridded.


This data is already gridded. Time is an integer from 1 to 72 (6 years of 12 months), as well as a date, and month and year. Space is a 24x24 grid of longitude and latitude, and also provided as an integer 1 to 24 in both x and y.

Focus on spatial (1/2)



Slice one month, and show gridded temperatures as a tiled display on spatial coordinates.

In January 2005, temperatures are

  • cool over land in the north
  • cool over the Andes in south america
  • warm on the equator, and along the coastline

There are 12*6=72 maps to make!! No problem.

# Get the map
sth_america <- map_data("world") |>
  filter(between(long, -115, -53), between(lat, -20.5, 41))

nasa_cb |> 
  face_temporal() |>
  filter(month == "Jan", year == 1995) |>
  select(id, time, surftemp) |>
  unfold(long, lat) |>
  ggplot() + 
  geom_tile(aes(x=long, y=lat, fill=surftemp)) +
  geom_path(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            colour="white", linewidth=1) +
  scale_fill_viridis_c("", option = "magma") +
  ggtitle("January 1995") +
  theme_map() +
  theme(legend.position = "bottom", 
        plot.title = element_text(size = 24)) 

Focus on spatial (2/2)

  • Exploring spatial trend over time is obtained by .monash-blue2[faceting the maps by time].

  • Can you see El Nino in 1997? Can you see the summer vs winter in the different hemispheres?

nasa_cb |> face_temporal() |>
  select(id, time, month, year, surftemp) |>
  unfold(long, lat) |>
  ggplot() + 
  geom_tile(aes(x=long, y=lat, fill=surftemp)) +
  facet_grid(year~month) +
  scale_fill_viridis_c("", option = "magma") +
  theme_map() +
  theme(legend.position = "bottom") 

Focus on temporal (1/4)

Code
nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, surftemp) %>%
  unfold(long, lat) %>%
  ggplot() +
    geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            fill="#014221", alpha=0.2, colour="#ffffff") +
    cubble::geom_glyph_box(data=nasa, 
                           aes(x_major = long, 
                               x_minor = date,
                               y_major = lat, 
                               y_minor = surftemp), fill=NA) +
    cubble::geom_glyph(data=nasa, 
                       aes(x_major = long, 
                           x_minor = date,
                           y_major = lat, 
                           y_minor = surftemp)) +
    theme_map() 

A glyphmap shows a (small) time series at each spatial location.

  • Several scaling choices:
    • global: overall min and max used to scale all locations
    • local: each location scaled on it’s own min/max


  • Global scale used here
  • Can see differences in the overall magnitude, particularly north to south.

Focus on temporal (2/4)

Code
nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, surftemp) %>%
  unfold(long, lat) %>%
  ggplot() +
    geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            fill="#014221", alpha=0.2, colour="#ffffff") +
    cubble::geom_glyph_box(data=nasa, 
                           aes(x_major = long, 
                               x_minor = date,
                               y_major = lat, 
                               y_minor = surftemp), fill=NA) +
    cubble::geom_glyph(data=nasa, 
                       aes(x_major = long, 
                           x_minor = date,
                           y_major = lat, 
                           y_minor = surftemp), 
                       global_rescale = FALSE) +
    theme_map() 

  • Note: Local scale used, min/max for each spatial location

  • El Nino year in equatorial region may be visible.

  • Notice also odd patterns on the west (Andes mountains) of South America.

Focus on temporal (3/4)

Code
nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, surftemp) %>%
  unfold(long, lat) %>%
  ggplot() +
    geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            fill="#014221", alpha=0.2, colour="#ffffff") +
    cubble::geom_glyph(data=nasa, 
                       aes(x_major = long, 
                           x_minor = date,
                           y_major = lat, 
                           y_minor = surftemp), 
                       global_rescale = FALSE,
                       polar = TRUE) +
    theme_map() 

  • Note: Local scale used, min/max for each spatial location, and polar coordinates used

Focus on temporal (4/4)

Code
nasa_mth <- nasa_cb %>% 
  face_temporal() %>%
  select(id, time, month, year, surftemp) %>%
  unfold(long, lat) %>%
  as_tibble() |>
  group_by(id, month) |>
  dplyr::summarise(tmin = min(surftemp),
            tmax = max(surftemp), 
            long = min(long),
            lat = min(lat)) |>
  ungroup() |>
  mutate(month = as.numeric(month))
ggplot() +
    geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            fill="#014221", alpha=0.2, colour="#ffffff") +
    geom_glyph_ribbon(data = nasa_mth, 
                      aes(x_major = long, 
                          x_minor = month,
                          y_major = lat, 
                          ymin_minor = tmin,
                          ymax_minor = tmax), 
                          width = 2) +
    theme_map() 

Seasonality can be the focus in the glyphs.

Here monthly min and max temperature over the 6 years is shown, as ribbon glyphs.

Adding interaction

DEMO
library(tsibble)
library(tsibbletalk)
library(lubridate)
library(plotly)
nasa_shared <- nasa %>% 
  mutate(date = ymd(date)) %>%
  select(long, lat, date, surftemp, id) %>%
  as_tsibble(index=date, key=id) %>%
  as_shared_tsibble()
p1 <- ggplot() +
  geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            colour="#ffffff", alpha=0.2, fill="#014221") +
  geom_point(data=nasa_shared, aes(x = long, 
         y = lat, group = id)) 
p2 <- nasa_shared %>%
  ggplot(aes(x = date, y = surftemp)) +
  geom_line(aes(group = id), alpha = 0.5) 
subplot(
    ggplotly(p1, tooltip = "Region"),
    ggplotly(p2, tooltip = "Region"),
    nrows = 1, widths=c(0.3, 0.7)) %>%
  highlight(dynamic = TRUE)

Inference for spatial trend

generate-data
# Set up a simple example
set.seed(945)
x <- 1:24
y <- 1:24
xy <- expand.grid(x, y)
d <- tibble(x=xy$Var1, y=xy$Var2) %>%
  mutate(v = x+2*y) 
d_sf <- SpatialPointsDataFrame(d[,1:2],
                   data.frame(d[,3]))
vgm_mod <- vgm(psill=5, model = "Sph", range=20, nmax=30)
d_dummy <- gstat(formula = v~1, dummy=TRUE, beta=0,
           model=vgm_mod)
d_err <- predict(d_dummy, d_sf, nsim=1)
d <- d %>%
  mutate(e = d_err@data$sim1*3) %>%
  mutate(ve = v+e)
plot
obs <- ggplot(d, aes(x, y, fill = ve)) +
  geom_tile() +
  scale_fill_viridis_c("") +
  theme(aspect.ratio = 1) +
  ggtitle("Observed") +
  theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank())
trend <- ggplot(d, aes(x, y, fill = v)) +
  geom_tile() +
  scale_fill_viridis_c("", option = "magma") +
  theme(aspect.ratio = 1) +
  ggtitle("Trend") +
  theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank())
err <- ggplot(d, aes(x, y, fill = e)) +
  geom_tile() +
  scale_fill_distiller("", palette = "PRGn") +
  theme(aspect.ratio = 1) +
  ggtitle("Residual") +
  theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank())
obs + trend + err + plot_layout(ncol=3)

Generate nulls by simulating from the spatial dependence model

generate-nulls
set.seed(953)
d_null <- predict(d_dummy, d_sf, nsim=5)
pos <- sample(1:6, 1)
lineup_plots <- list()
j <- 1
for (i in 1:6) {
  if (pos == i) { # plot data
    p <- ggplot(d, aes(x, y, fill = scale(ve))) +
           geom_tile() 
  } 
  else { # plot nulls
    null_df <- tibble(x=d$x, y=d$y, v=d_null@data[,j])
    p <- ggplot(null_df, aes(x, y, fill = scale(v))) +
           geom_tile() 
   j <- j + 1
  }
  p <- p +
        scale_fill_viridis_c("", option = "magma") +
        theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank())
    
  lineup_plots[[paste(i)]] <- p
}
wrap_plots(lineup_plots, ncol = 3)

Median polish (1/2)

  1. Compute overall median and residual table.
  2. Compute the row medians.
  3. Create a new residual table from the row medians.
  4. Compute the column medians.
  5. Create a new residual table from the column medians.
  6. Second iteration – row effects.
  7. Second iteration – column effects
  8. Iterate through steps 2-5 until row and column effect medians are close to 0.

Median polish (2/2)

Column and row effects from median polish

Code
library(tukeyedar)
d_pol <- eda_pol(d, row = x, col = y, val = ve, plot=FALSE)
long <- ggplot(d, aes(x, y=ve)) +
  geom_point() +
  geom_smooth(se=F) +
  geom_point(data = d_pol$row, aes(x=x, 
                       y=effect + d_pol$global), 
    colour = "#D93F00", size=3) +
  ylab("obs") +
  theme(aspect.ratio = 1)
lat <- ggplot(d, aes(y, y=ve)) +
  geom_point() +
  ylab("obs") +
  geom_smooth(se=F) +
  geom_point(data = d_pol$col, aes(x=y, 
                      y=effect + d_pol$global), 
    colour = "#D93F00", size=3) +
  theme(aspect.ratio = 1)
long + lat + plot_layout(ncol=2)

Resulting in the residuals as:

Code
pol_res <- ggplot(d_pol$long, aes(x, y, fill = ve)) +
  geom_tile() +
  scale_fill_distiller("", palette = "PRGn") +
  theme(aspect.ratio = 1) +
  ggtitle("Polish Residuals") +
  theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank())
err <- err +
  theme(legend.position = "none",
              axis.text = element_blank(),
              axis.title = element_blank()) 
err + pol_res + plot_layout(ncol=2)

Spatial data needs maps

Spatial polygon data

Spatial polygon data

Code
oz <- world_map %>% 
  filter(region == "Australia") %>%
  filter(lat > -50)
m1 <- ggplot(oz, aes(x = long, y = lat)) + 
  geom_point(size=0.2) + #<<
  coord_map() +
  ggtitle("Points")
m2 <- ggplot(oz, aes(x = long, y = lat, 
               group = group)) + #<<
  geom_path() + #<<
  coord_map() +
  ggtitle("Path")
m3 <- ggplot(oz, aes(x = long, y = lat, 
               group = group)) + #<<
  geom_polygon(fill = "#607848", colour = "#184848") + #<<
  coord_map() +
  ggtitle("Filled polygon")
m1 + m2 + m3

Spatial polygon data, includes measured values (variables) associated with a spatial polygon.

Spatial polygon data

STEP 1: Thin your map!

Most spatial polygon data is large, with high resolution on the polygons.

This makes them SLOW to plot.

For data analysis needs fast plotting, and resolution can be smaller.

rmapshaper::ms_simplify() is the best function.

sf: Simple spatial polygon objects in R


Has a coordinate system (projection), and bounding box. Supports technically accurate distance calculations between coordinates (on a sphere).

Choropleth maps

A choropleth map is used to show a measured variable associated with a political or geographic region. Polygons for the region are filled with colour.

The purpose is to examine the spatial distribution of a variable.

Code
sa2 <- strayr::read_absmap("sa22011") %>% 
  filter(!st_is_empty(geometry)) %>% 
  filter(!state_name_2011 == "Other Territories") %>% 
  filter(!sa2_name_2011 == "Lord Howe Island")
sa2 <- sa2 %>% rmapshaper::ms_simplify(keep = 0.5, keep_shapes = TRUE) # Simplify the map!!!
SIR <- read_csv(here::here("data/SIR Downloadable Data.csv")) %>% 
  filter(SA2_name %in% sa2$sa2_name_2011) %>% 
  dplyr::select(Cancer_name, SA2_name, Sex_name, p50) %>% 
  filter(Cancer_name == "Thyroid", Sex_name == "Females")
ERP <- read_csv(here::here("data/ERP.csv")) %>%
  filter(REGIONTYPE == "SA2", Time == 2011, Region %in% SIR$SA2_name) %>% 
  dplyr::select(Region, Value)
# Alternative maps
# Join with sa2 sf object
sa2thyroid_ERP <- SIR %>% 
  left_join(sa2, ., by = c("sa2_name_2011" = "SA2_name")) %>%
  left_join(., ERP %>% 
              dplyr::select(Region, 
              Population = Value), by = c("sa2_name_2011"= "Region")) %>% 
  filter(!st_is_empty(geometry))
sa2thyroid_ERP <- sa2thyroid_ERP %>% 
  #filter(!is.na(Population)) %>% 
  filter(!sa2_name_2011 == "Lord Howe Island") %>% 
  mutate(SIR = map_chr(p50, aus_colours)) %>% 
  st_as_sf() 
save(sa2, file="data/sa2.rda")
save(sa2thyroid_ERP, file="data/sa2thyroid_ERP.rda")
Code
# Plot the choropleth
load("../data/sa2thyroid_ERP.rda")
aus_ggchoro <- ggplot(sa2thyroid_ERP) + 
  geom_sf(aes(fill = SIR), size = 0.1) + 
  scale_fill_identity() + invthm
aus_ggchoro

The problem with choropleth maps

The problem is that high density population areas may be very small geographically. They can disappear in a choropleth map, which means that we get a biased sense of the spatial distribution of a variable.

Cartograms

A cartogram transforms the geographic shape to match the value of a statistic or the population. Its a useful exploratory technique for examining the spatial distribution of a measured variable.



BUT they don’t work for Australia.

Code
# transform to NAD83 / UTM zone 16N
nc <- nc |>
  mutate(lBIR79 = log(BIR79))
nc_utm <- st_transform(nc, 26916)

orig <- ggplot(nc) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("original") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_carto <- cartogram_cont(nc_utm, weight = "BIR74", itermax = 5)

carto <- ggplot(nc_utm_carto) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("cartogram") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_dorl <- cartogram_dorling(nc_utm, weight = "BIR74")

dorl <- ggplot(nc_utm_dorl) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("dorling") +
  theme_map() +
  theme(legend.position = "none")

orig + carto + dorl + plot_layout(ncol=1)

Hexagon tile

A hexagon tile map represents every spatial polygon with an equal sized hexagon. In dense areas these will be tesselated, but separated hexagons are placed at centroids of the remote spatial regions.


It’s not perfect, but now the higher incidence in Perth suburbs, some melbourne suburbs, and Sydney are more visible.

Code
if (!file.exists(here::here("data/aus_hexmap.rda"))) {
  
## Create centroids set
centroids <- sa2 %>% 
  create_centroids(., "sa2_name_2011")
## Create hexagon grid
grid <- create_grid(centroids = centroids,
                    hex_size = 0.2,
                    buffer_dist = 5)
## Allocate polygon centroids to hexagon grid points
aus_hexmap <- allocate(
  centroids = centroids,
  hex_grid = grid,
  sf_id = "sa2_name_2011",
  ## same column used in create_centroids
  hex_size = 0.2,
  ## same size used in create_grid
  hex_filter = 10,
  focal_points = capital_cities,
  width = 35,
  verbose = FALSE
)
save(aus_hexmap, 
     file = here::here("data/aus_hexmap.rda")) 
}

load(here::here("data/aus_hexmap.rda"))
## Prepare to plot
fort_hex <- fortify_hexagon(data = aus_hexmap,
                            sf_id = "sa2_name_2011",
                            hex_size = 0.2) %>% 
            left_join(sa2thyroid_ERP %>% select(sa2_name_2011, SIR, p50))
## Make a plot
aus_hexmap_plot <- ggplot() +
  geom_sf(data=sa2thyroid_ERP, fill=NA, colour="grey60", size=0.1) +
  geom_polygon(data = fort_hex, aes(x = long, y = lat, group = hex_id, fill = SIR)) +
  scale_fill_identity() +
  invthm 
aus_hexmap_plot  

Resources