install.packages(c("tidyverse", "tourr", "GGally", "plotly", "colorspace", "mulgar", "simputation", "naniar", "crosstalk", "sf", "ozmaps", "ggthemes"))
ETC5521 Tutorial 8
Going beyond two variables, exploring high dimensions
🎯 Objectives
These are exercises in plots to make to explore relationships between multiple variables. You will use interactive scatterplot matrices, interactive parallel coordinate plots and tours to explore the world beyond 2D.
🔧 Preparation
- The reading for this week is Cook and Laa (2023) “Interactively exploring high-dimensional data and models in R” Chapter 1.
- Complete the weekly quiz, before the deadline!
- Install the following R-packages if you do not have them already:
- Open your RStudio Project for this unit (the one you created in week 1,
ETC5521
). Create a.qmd
document for this weeks activities.
📥 Exercises
Exercise 1: Melbourne housing
- Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. It’s fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”.
<- read_csv("https://raw.githubusercontent.com/njtierney/melb-housing-data/master/data/housing.csv") |>
mel_houses ::filter(suburb %in% c("South Yarra", "Brighton")) dplyr
- There are a substantial number of missing values. These need to be handled first because examining multiple variables, with almost every method, requires complete data. Examine the missing value distribution using
naniar
, and strategise on how to handle the missings.
- Now implement your strategy for removing, and imputing missing values, until you have a complete data set. This can involve some iteration on the strategy, as we find that some other approach is needed.
- Make a scatterplot matrix of price, rooms, bedroom2, bathroom, landsize, latitude and longitude, suburb. The order of variables can affect the readability. I advise that the plot will be easier to read if you order the variables a little, at least put price first. Think about what associations can be seen?
- Here’s where we will start using interactive plots to explore the multivariate relationships. Subset the data to Brighton only. This will make the analysis easier. Make an interactive scatterplot matrix of price, rooms, bedroom2, bathroom and landsize, coloured by type of property. There are some high price properties. Select these cases, and determine what’s special about them – if anything.
- The realtors mantra is location, location, location! Next check the location of properties relative to price and size, using two linked plots. One should have the longitude and latitude, and the other price by rooms. Ideally, you can make a map underneath the spatial coordinates, to better put these in context.
Exercise 2: Challenges
For each of the data sets, c1
, …, c7
from the mulgar
package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).
👌 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.