ETC5521 Tutorial 8

Going beyond two variables, exploring high dimensions

Author

Prof. Di Cook

🎯 Objectives

These are exercises in plots to make to explore relationships between multiple variables. You will use interactive scatterplot matrices, interactive parallel coordinate plots and tours to explore the world beyond 2D.

🔧 Preparation

install.packages(c("tidyverse", "tourr", "GGally", "plotly", "colorspace", "mulgar", "simputation", "naniar", "crosstalk", "sf", "ozmaps", "ggthemes"))
  • Open your RStudio Project for this unit (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Melbourne housing

  1. Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. It’s fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”.
mel_houses <- read_csv("https://raw.githubusercontent.com/njtierney/melb-housing-data/master/data/housing.csv") |>
  dplyr::filter(suburb %in% c("South Yarra", "Brighton")) 
  1. There are a substantial number of missing values. These need to be handled first because examining multiple variables, with almost every method, requires complete data. Examine the missing value distribution using naniar, and strategise on how to handle the missings.

This is an awful data set to wrangle into a form that can be analysed! First step is to get a sense of the missing values in each of the suburbs chosen. It is always a good idea to break it into these groups, if there aren’t too many, in case there is a difference in data handling between locations.

m1_sy <- mel_houses |> filter(suburb == "South Yarra") |> gg_miss_var(show_pct = TRUE) + ggtitle("South Yarra")
m1_b <- mel_houses |> filter(suburb == "Brighton") |> gg_miss_var(show_pct = TRUE) + ggtitle("Brighton")
m2_sy <- mel_houses |> filter(suburb == "South Yarra") |> gg_miss_case(show_pct = TRUE) + ggtitle("South Yarra")
m2_b <- mel_houses |> filter(suburb == "Brighton") |> gg_miss_case(show_pct = TRUE) + ggtitle("Brighton")
m1_sy + m1_b + m2_sy + m2_b + plot_layout(ncol=2)
Figure 1: Plots of missing by variable and case.

Summary:

  • Similar missingness in both suburbs. Although there are some slight differences, the same variables have missings and a similar number of cases have substantial missings.
  • The main information about variables to note is that building_area and year_built both have more than 50% missing.
m3 <- vis_miss(mel_houses, cluster = TRUE, sort_miss = TRUE)
m3
Figure 2: Overall summary of missing values, with variables and cases grouped.

Summary:

There are co-occurring missings.

  • If bedroom2 is missing then likely bathroom, landsize, car, latitude and longitude are missing.
  • When price is missing, most other variables are available - this is good!
  • But often price and landsize are both missing.
  • There is a small amount of sporadic missing values other than these.

Strategy:

  • remove/ignore the two worst variables, building_area and year_built which have more missings than present data.
  • remove observations that have missings on all of these variables: landsize, car, bathroom, bedroom2, bathroom, latitude, longitude. There is not enough information in the remaining variables to be able to impute the missing values.

To impute missing values the choice is to fit a regression model on complete records. (We have previously noted in lecture that to impute price we need to take into account other variables because missingness appears to depend on other variables.) This will be done in pieces, to impute price when observations on all other variables are available, we will use these.

  1. Now implement your strategy for removing, and imputing missing values, until you have a complete data set. This can involve some iteration on the strategy, as we find that some other approach is needed.

We will separate date into year and quarter, as these might be useful for the imputation - prices will change from year to year, and, from experience, we know that there are some seasonal differences in property prices.

Notice that, the filtering and imputation should be done in small steps, checking the resulting data after each.

mel_houses_na <- mel_houses |>
  mutate(year = lubridate::year(date),
         quarter = lubridate::quarter(date)) |>
  select(-building_area, -year_built) |>
  filter(!(is.na(car) & 
             is.na(bathroom) &
             is.na(bedroom2) &
             is.na(latitude) &
             is.na(longitude) &
             is.na(landsize)))

# check data again
(nrow(mel_houses) - nrow(mel_houses_na))/nrow(mel_houses)
[1] 0.22
# 21% of observations removed - that's a lot but 
# realistically it is not possible to impute 
# price without information on 
# bedrooms, baths, landsize and location
vis_miss(mel_houses_na, cluster = TRUE, sort_miss = TRUE)

Notice that 21% of observations removed - that’s a lot but realistically it is not possible to impute price without information on bedrooms, baths, landsize and location.

Next, handling landsize might be done better with some logic. Check landsize and price and type of unit. It may be that landsize is missing for units, which would be ok, and then we would use 0 as the imputed value.

ggplot(mel_houses_na, aes(x=landsize, y=price)) +
  geom_miss_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

Hmm, no but maybe for units and townhouses it is reasonable to set landsize missings to be 0, because technically there is little land owned by owners of units and townhouses.

mel_houses_na <- mel_houses_na |>
  mutate(landsize = if_else(type %in% c("u", "t") & is.na(landsize), 0, landsize))
vis_miss(mel_houses_na, cluster = TRUE, sort_miss = TRUE)

Now remove observations missing on both landsize and price, because houses can’t really be priced without knowing landsize.

mel_houses_na <- mel_houses_na |>
  filter(!(is.na(landsize) & is.na(price)))
vis_miss(mel_houses_na, cluster = TRUE, sort_miss = TRUE)

Here is the imputation using regression, and the simputation function impute_lm made available from naniar. I recommend running each part of the pipe, and making a couple of plots to examine the result. I’ve put the piped code together after doing this, to make it more readable, but it took some work checking the results, and fixing a few things each time, to confidently be able to make it all one big imputation.

Note also, that imputed values for bedroom2, bathroom and car have been rounded to integer. The regression model will impute these with decimal places, which is not realistic in the later analysis. This is one place, where I iterated in the handling os missings after noticing that bedrooms and bathrooms looked odd.

# Conduct imputation, carefully!
mel_houses_imputed <- mel_houses_na |> 
  nabular() |>
  as.data.frame() |> 
  # bedroom2
  impute_lm(bedroom2 ~ type + quarter + year +  
              latitude + longitude +
              postcode + distance) |>
  mutate(bedroom2 = round(bedroom2, 0)) |>
  # bathroom
  impute_lm(bathroom ~ type + quarter + year + 
              latitude + longitude +
              postcode + distance) |>
  mutate(bathroom = round(bathroom, 0)) |>
  # car
  impute_lm(car ~ type + quarter + year +  
              bedroom2 + bathroom + latitude +
              longitude + postcode + distance) |>
  mutate(car = round(car, 0)) |>
  # landsize
  impute_lm(landsize ~ type + quarter + year + 
              bedroom2 + bathroom + 
              latitude + longitude +
              postcode + distance)

# check again
vis_miss(mel_houses_imputed, cluster = TRUE,
         sort_miss = TRUE)

ggplot(mel_houses_na, aes(x=landsize, y=price)) +
  geom_miss_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

ggplot(mel_houses_imputed, aes(x=landsize, y=price)) +
  geom_miss_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

# Also check bedroom2, bathroom and car

# price
mel_houses_imputed <- mel_houses_imputed |>
  impute_lm(price ~ type + quarter + year + landsize + 
              bedroom2 + bathroom + latitude +
              longitude + postcode + distance) 
# check again
vis_miss(mel_houses_imputed, cluster = TRUE, sort_miss = TRUE)

ggplot(mel_houses_imputed, aes(x=landsize, y=price)) +
  geom_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

ggplot(mel_houses_imputed, aes(x=bedroom2, y=price)) +
  geom_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

ggplot(mel_houses_imputed, aes(x=bathroom, y=price)) +
  geom_point() +
  facet_wrap(~type, ncol=3, scales="free") +
  theme(aspect.ratio = 1)

# Now add label
mel_houses_imputed <- mel_houses_imputed |>
  add_label_shadow()
# Note: method not used because it causes problems with missing factor levels. Ideally it is included.

One more check of resulting data.

ggplot(mel_houses, aes(x=bedroom2, y=price)) +
         geom_miss_point() +
         facet_grid(type~suburb)

ggplot(mel_houses_imputed, aes(x=bedroom2, y=price, colour=price_NA)) +
         geom_point() +
         facet_grid(type~suburb) +
  scale_colour_discrete_divergingx(palette="Zissou 1")

Ok, quite confident now that we have a reasonable set of data to work with. There are some odd values in the data still that will need handling at some point, like negative landsize. But for now these can be part of the analysis, as is.

Having the additional variables _NA will allow identifying observations that have been imputed as we work through the analysis.

  1. Make a scatterplot matrix of price, rooms, bedroom2, bathroom, landsize, latitude and longitude, suburb. The order of variables can affect the readability. I advise that the plot will be easier to read if you order the variables a little, at least put price first. Think about what associations can be seen?
ggpairs(mel_houses_imputed,
        columns=c(4,2,10,11,13,15,16,1,3))

  • Some continuous variables are discrete. We can still examine the associations. It could be useful to use a jittered scatterplot, but that would require making a special plot function to use in the ggpairs function.
  • There is positive linear association between price, rooms, bedroom2, bathroom, which indicates the bigger the house the higher the price
  • From the boxplots: houses in Brighton tend to be higher priced and bigger than South Yarra, and houses tend to be worth more than townhouses and units.
  • There are lots of outliers.
# To add jitter
ggpairs(mel_houses_imputed,
        columns=c(4,2,10,11,13,15,16),
        lower=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3))))

  1. Here’s where we will start using interactive plots to explore the multivariate relationships. Subset the data to Brighton only. This will make the analysis easier. Make an interactive scatterplot matrix of price, rooms, bedroom2, bathroom and landsize, coloured by type of property. There are some high price properties. Select these cases, and determine what’s special about them – if anything.
brighton <- mel_houses_imputed |> 
  dplyr::filter(suburb=="Brighton") 
highlight_key(brighton) |>
  ggpairs(aes(shape = price_NA, colour=type), 
          columns = c(4,2,10,11,13),
                  upper=list(continuous="points")) |>
  ggplotly(900, 900) |>
  highlight("plotly_selected")
# To add jittering
# which makes it a little easier to see all points
# in the discrete variables
highlight_key(brighton) |>
  ggpairs(aes(shape = price_NA, colour=type), 
          columns = c(4,2,10,11,13),
          lower=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3))),
          upper=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3)))) |>
  ggplotly(900, 900) |>
  highlight("plotly_selected") 
  • There is one very high price, which is a modest size of 4 bedrooms and 3 bathrooms.
  • There is one property with an extraordinarily large number of rooms, but this is an imputed value for price. Something may have gone wrong with the imputation, because it has a low price despite being a big property. Despite the large number of rooms it only has 3 bedrooms and 2 bathrooms.
  1. The realtors mantra is location, location, location! Next check the location of properties relative to price and size, using two linked plots. One should have the longitude and latitude, and the other price by rooms. Ideally, you can make a map underneath the spatial coordinates, to better put these in context.

The ozmaps package has quick maps of Australia that can be used. The code below checks that the map can be made.

# Check location
library(sf)
library(ozmaps)
library(ggthemes)
library(crosstalk)
lga_sf <- ozmap_data("abs_lga")
brighton_map <- lga_sf |>
  filter(NAME %in% c("Bayside (C)", 
                     "Glen Eira (C)",
                     "Stonnington (C)",
                     "Port Phillip (C)"))
basic_map <- ggplot(brighton_map) + geom_sf() +
  geom_point(data=brighton, 
             aes(x=longitude, y=latitude,
                 colour=type, shape=price_NA)) +
  theme_map() +
  theme(legend.position = "none")

This code can be used to make the interactive linked plots.

# Now make interactive linked plots
shared_brighton <- SharedData$new(brighton)

map <- ggplot(brighton_map) + 
  geom_sf() +
  geom_point(data=shared_brighton, 
             aes(x=longitude, y=latitude,
                 colour=type, shape=price_NA)) +
  theme_map() +
  theme(legend.position = "none")

sp <- ggplot(shared_brighton, 
             aes(x=rooms, y=price, 
                 colour=type, 
                 shape=price_NA)) +
  geom_point() +
  theme(legend.position="none")

mapi <- ggplotly(map) |>
  highlight(on = "plotly_selected", 
              off = "plotly_deselect")
spi <- ggplotly(sp) |>
  highlight(on = "plotly_selected", 
              off = "plotly_deselect")

bscols(
  widths = c(5, 5), 
  mapi, 
  spi 
 )
  • The high priced house is right against the bay.
  • The property with the large number of rooms is also close to the bay.
  • Otherwise, if you select the properties near the bay, they don’t seem to be especially higher in price.

Exercise 2: Challenges

For each of the data sets, c1, …, c7 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).

library(mulgar)
animate_xy(c1)
# four small clusters, two big clusters
# linear dependence
animate_xy(c2) 
# Six spherical clusters
animate_xy(c3)
# tetrahedron with lots of smaller triangles,
# barriers, linear dependence
animate_xy(c4) 
# Four linear connected pieces
animate_xy(c5)
# Spiral in lower dimensional space
# Non-linear and linear dependence
animate_xy(c6)
# Two curved clusters
animate_xy(c7)
# spherical cluster, curve cluster and a lot of noise points

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.