ETC5521 Tutorial 10

Exploring data having a space and time context

Author

Prof. Di Cook

🎯 Objectives

This tutorial practices rearranging spatiotemporal data to focus on spatial or temporal patterns, and constructing choropleth maps and cartograms.

🔧 Preparation

Resources for this week is Moraga (2019) Spatial data and R packages for mapping; cubble: A Vector Spatio-Temporal Data Structure for Data Analysis; Making maps plot faster Simplify spatial polygons; sf: Simple Features for R.
Complete the weekly quiz, before the deadline!
Install the following R-packages if you do not have them already:

install.packages(c("tidyverse","here","lubridate","GGally","tsibble","cubble","forcats","cartogram","sf","cartogram","patchwork","ggthemes", "sugarbag", "viridis", "rmapshaper"))
remotes::install_github("runapp-aus/strayr")

Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Melbourne Covid-19 outbreak

In Melbourne we were in a strict lockdown for much of 2020, and large chunks of 2021. Each week we got our hopes up that restrictions might be eased, and once again these hopes were dashed by announcements each week, keeping the restrictions a little longer. The data we have collected here are the case counts by Victorian local government area (LGA) since the beginning of July, 2020. We will examine the spatiotemporal distribution of these counts.

Working with spatial data is always painful! It almost always requires some ugly code.

Part of the reason for the difficulty is the use of special data objects, that describe maps. There are several different choices, and some packages and tools use one, and others use another, so not all tools work together. The sf package helps enormously, but when you run into errors it can be hard to debug.
Another reason is that map objects can be very large, which makes sense for accurate mapping, but for data analysis and visualisation, we’d rather have smaller, even if slightly inaccurate, spatial objects. It is virtually always necessary to thin out map data before doing further analysis - you need special tools for this, eg mapshapr. We don’t really need this for the exercises here, because the strayr version of the LGAs is already thinned.
Another problem commonly encountered is that there are numerous coordinate systems, and types of projections of the 3D globe into a 2D canvas. We have become accustomed to lat/long but like time its an awkward scale to compute on because a translation from E/W and N/S to positive and negative values is needed. More commonly a Universal Transverse Mercator (UTM) is the standard but its far less intuitive to use.
And yet another reason is that keys linking data tables and spatial tables may not match perfectly because there are often synonyms or slightly different name preferences between different data collectors.

The code for all the analysis is provided for you in the solution. We recommend that you run the code in steps to see what it is doing, why the mutating and text manipulations are necessary. Talk about the code with each other to help you understand it.

a. Read case counts for 2020

The file melb_lga_covid.csv contains the cases by LGA. Read the data in and inspect result. You should find that some variables are type chr because “null” has been used to code entries on some days. This needs fixing, and also missings should be converted to 0. Why does it make sense to substitute missings with 0, here?

Solution

NAs really have to be 0s. Its likely that the cells were left blank when numbers were recorded, left blank because there were no cases that day.

# Read the data
# Replace null with 0, for three LGAs
# Convert to long form to join with polygons
# Make the date variables a proper date
# Set NAs to 0, this is a reasonable assumption
covid <- read_csv("https://raw.githubusercontent.com/numbats/ddde/master/data/melb_lga_covid.csv") |>
  mutate(Buloke = as.numeric(ifelse(Buloke == "null", "0", Buloke))) |>
   mutate(Hindmarsh = as.numeric(ifelse(Hindmarsh == "null", "0", Hindmarsh))) |>
   mutate(Towong = as.numeric(ifelse(Towong == "null", "0", Towong))) |>
  pivot_longer(cols = Alpine:Yarriambiack, names_to="NAME", values_to="cases") |>
  mutate(Date = ydm(paste0("2020/",Date))) |>
  mutate(cases=replace_na(cases, 0))

b. Check the data

Check the case counts to learn whether they are daily or cumulative. The best way to do this is select one suburb where there were substantial cases, and make a time series. If the counts are cumulative, calculate the daily counts, and re-check the temporal trend for your chosen LGA. Describe the temporal trend, and any visible artifacts.

Solution

This is cumulative data. Take the most recent case count as the value to use. If you wanted to explore the temporal trend, you would need to take lags to compute the differences between days. This is interesting because it also generated some negative values!

# Check the case counts
covid |> filter(NAME == "Brimbank") |>
  ggplot(aes(x=Date, y=cases)) +
    geom_point()

# Case counts are cumulative, so take lags to get daily case counts
covid <- covid |>
  group_by(NAME) |>
  mutate(new_cases = cases - dplyr::lag(cases)) |>
  na.omit()

# Check the case counts
covid |> filter(NAME == "Brimbank") |>
  ggplot(aes(x=Date, y=new_cases)) +
    geom_col()

# Only keep the latest date information
covid_tot <- covid |>
  filter(Date == max(Date))
# Double-check we have all LGAs
# covid |> count(NAME)

c. Spatial polygons size

Now let’s get polygon data of Victorian LGAs using the strayr package. The map is already fairly small, so it doesn’t need any more thinning, but we’ll look at how thinning works.

Get a copy of the lga2018 using strayr::read_absmap(). Save the resulting data as an .rda file, and plot the map.

Now run rmapshaper::ms_simplify(), saving it as a different object. Save the object as an .rda file, and plot the map.

What is the difference in file size before and after thinning. Can you see a difference in the map?

Solution

Before thinning the file is 523 KB, and after thinning it is just 33 KB.

The thinning has removed some small islands, and smoother the boundaries between LGAs.

# Read the LGA data from strayr package. 
# This has LGAs for all of Australia. 
# Need to filter out Victoria LGAs, avoiding LGAs 
# from other states with same name, and make the names
# match covid data names. The regex equation is
# removing () state and LGA type text strings
# Good reference: https://r-spatial.github.io/sf/articles/sf1.html
lga <- strayr::read_absmap("lga2018") |>
  rename(lga = lga_name_2018) |>
  filter(state_name_2016 == "Victoria") 
save(lga, file="data/lga.rda")
ggplot(lga) + geom_sf() + theme_map()

lga_sm <- ms_simplify(lga)
save(lga_sm, file="data/lga_sm.rda")
ggplot(lga_sm) + geom_sf() + theme_map()

c. Spatial polygons matching

Now let’s match polygon data of Victorian LGAs to the COVID counts. The cubble::check_key() can be used to check if the keys match between spatial and temporal data sets.

You will find that we need to fix some names of LGAs, even though cubble does a pretty good job working out which are supposed to match.

Solution

load("../data/lga.rda")

# Turn the full covid data into a tsibble

covid <- covid |>
  select(-cases) |>
  rename(lga = NAME, date=Date, cases = new_cases) 
covid_ts <- as_tsibble(covid, key=lga, index=date)

# Check the name matching using cubble function, check_key
covid_matching <- check_key(spatial = lga, temporal = covid_ts)
covid_matching

$paired
# A tibble: 0 × 0

$potential_pairs
# A tibble: 78 × 2
   spatial        temporal  
   <chr>          <chr>     
 1 Alpine (S)     Alpine    
 2 Ararat (RC)    Ararat    
 3 Ballarat (C)   Ballarat  
 4 Banyule (C)    Banyule   
 5 Bass Coast (S) Bass Coast
 6 Baw Baw (S)    Baw Baw   
 7 Bayside (C)    Bayside   
 8 Benalla (RC)   Benalla   
 9 Boroondara (C) Boroondara
10 Brimbank (C)   Brimbank  
# ℹ 68 more rows

$others
$others$spatial
[1] "Colac-Otway (S)"                       
[2] "Unincorporated Vic"                    
[3] "No usual address (Vic.)"               
[4] "Migratory - Offshore - Shipping (Vic.)"

$others$temporal
[1] "Colac Otway"


attr(,"class")
[1] "key_tbl" "list"

# Fix matching
lga <- lga |> 
  mutate(lga = ifelse(lga == "Colac-Otway (S)", "Colac Otway (S)", lga)) |>
  filter(!(lga %in% covid_matching$others$spatial)) |>
  mutate(lga = str_replace(lga, " \\(.+\\)", "")) # Remove (.)

# Re-do the name matching
covid_matching <- check_key(spatial = lga, temporal = covid_ts)
covid_matching

$paired
# A tibble: 79 × 2
   spatial    temporal  
   <chr>      <chr>     
 1 Alpine     Alpine    
 2 Ararat     Ararat    
 3 Ballarat   Ballarat  
 4 Banyule    Banyule   
 5 Bass Coast Bass Coast
 6 Baw Baw    Baw Baw   
 7 Bayside    Bayside   
 8 Benalla    Benalla   
 9 Boroondara Boroondara
10 Brimbank   Brimbank  
# ℹ 69 more rows

$potential_pairs
# A tibble: 0 × 0

$others
$others$temporal
character(0)

$others$spatial
character(0)


attr(,"class")
[1] "key_tbl" "list"

# Join covid cases to spatial polygons
covid_tot <- covid_tot |>
  rename(lga = NAME)
covid_lga <- left_join(lga, covid_tot) |>
  st_as_sf()

e. Choropleth map

Sum the counts over the time period for each LGA, merge the COVID data with the map polygons (LGA) and create a choropleth map. The LGA data is an sf object so the geom_sf will automatically grab the geometry from the object to make the spatial polygons. Where was the highest COVID incidence?

Solution

The high count LGAs are all in Melbourne, mostly in the western suburbs.

# Make choropleth map, with appropriate colour palette
ggplot(covid_lga) + 
  geom_sf(aes(fill = cases, label=lga), colour="white") + 
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) + 
  theme_map() +
  theme(legend.position="bottom")

# Make it interactive
# plotly::ggplotly()

f. Cartogram

To make a population-transformed polygon we need to get population data for each LGA. The file VIF2019_Population_Service_Ages_LGA_2036.xlsx has been extracted from the Vic Gov web site. It is a complicated xlsx file, with the data in sheet 3, and starting 13 rows down. The readxl package is handy here to extract the population data needed. You’ll need to join the population counts to the map data to make a cartogram. Once you have the transformed polygon data, the same plotting code can be used, as created the choropleth map.

Solution

Interestingly, the population of the LGAs is quite different, with densely populated LGAs in Melbourne. These get greatly enlarged by the algorithm, and LGA polygons from the rural areas are much smaller. It makes it easier to see the LGAs with high case counts, and also all of the LGAs in the city with low counts. (Note: the white inner city polygons are not actually LGAs, just unfortunate artifacts of the cartogram transformation.

# Incorporate population data to make cartogram
# Population from https://www.planning.vic.gov.au/land-use-and-population-research/victoria-in-future/tab-pages/victoria-in-future-data-tables
# Data can be downloaded from https://github.com/numbats/eda/blob/master/data/VIF2019_Population_Service_Ages_LGA_2036.xlsx
pop <- read_xlsx("../data/VIF2019_Population_Service_Ages_LGA_2036.xlsx", sheet=3, skip=13, col_names = FALSE) |>
  select(`...4`, `...22`) |>
  rename(lga = `...4`, pop=`...22`) |>
  filter(lga != "Unincorporated Vic") |> 
  mutate(lga = str_replace(lga, " \\(.+\\)", "")) |>
  mutate(lga = ifelse(lga == "Colac-Otway", "Colac Otway", lga)) 

covid_lga <- covid_lga |>
  left_join(pop) 

# Compute additional statistics
covid_lga <- covid_lga |>
  mutate(cases_per10k = cases/pop*10000,
         lcases = log10(cases + 1)) 

# Make a contiguous cartogram
# The sf object is in lat/long (WGS84) which is 
# an angle on the globe, but the cartogram 
# needs spatial locations in metres/numeric 
# as given by EPSG:3395.
# So we convert to metres and then back to 
# lat/long with st_transform
covid_carto <- covid_lga |> 
  st_transform(3395) |> 
  cartogram_cont("pop") |>
  st_transform("WGS84") 
# The cartogram object contains a mix of MULTIPOLYGON
# and POLYGON - yes, amazing! - st_cast() forces all 
# to be MULTIPOLYGON and is necessary for plotly 
covid_carto <- st_cast(covid_carto, "MULTIPOLYGON") 
# st_geometry() is a good function for checking
# the projection (lat/long vs metres) and POLYGON

ggplot(covid_carto) + 
  geom_sf(aes(fill = cases, label=lga), colour="white") + 
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) + 
  theme_map() +
  theme(legend.position="bottom")

# ggplotly()

g. Hexagon tile map

Use the provided code to make a hexgon tile map, with functions from the sugarbag package. Is it easier to see the spatial distribution of incidence from the hexagon tile map, or the choropleth or the cartogram?

Solution

The hexagon tile map leaves the georgraphy more recognisable than the cartogram, and it is easier to see the high incidence LGAs, and that they are in the city.

# Placement of hexmaps depends on position relative to
# Melbourne central
data(capital_cities)
covid_hexmap <- create_hexmap(
  shp = covid_lga,
  sf_id = "lga",
  focal_points = capital_cities, verbose = TRUE)
# This shows the centroids of the hexagons
ggplot(covid_hexmap, aes(x=hex_long, y=hex_lat)) +
  geom_point()

# Hexagons are made with the `fortify_hexagon` function
covid_hexmap_poly <- covid_hexmap |>
  fortify_hexagon(sf_id = "lga", hex_size = 0.1869) |>
  left_join(covid_tot, by="lga") # hexmap code removed cases!
ggplot() +
  geom_sf(data=covid_lga, 
          fill = "grey95", colour = "white", linewidth=0.1) +
  geom_polygon(data=covid_hexmap_poly, 
               aes(x=long, y=lat, group=hex_id, 
                   fill = cases, 
                   colour = cases,
                   label=lga), size=0.2) +
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) +
  scale_colour_distiller("Cases", palette = "YlOrRd",
                       direction=1) +
  theme_map() +
  theme(legend.position="bottom")

# ggplotly()

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.