ETC5521 Tutorial 2

Introduction to exploratory data analysis

Author

Prof. Di Cook

Published

29 July 2024

🎯 Objectives

The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.

🔧 Preparation

The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.

  • Complete the weekly quiz, before the deadline!
  • Install this list of R packages, in addition to what you installed in the previous weeks:
install.packages(c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray", "cranlogs", "tidyverse", "nycflights13"))
  • Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.

1.

What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)

2.

Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. Which of these packages has the highest download rate over the period Jan 1, 2024-today?

3.

What is an interesting pattern to observe from the time series plot of all the downloads?

4.

How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?

5.

The package summarytools appears to becoming more favourable relative to visdat. Take a look at this package and explain what tools it has that are not available in visdat.

6.

Why do you think the package SmartEDA has gone out of favour?

7.

The SmartEDA::ExpReport() and DataExplorer::create_report() are functions that generates a long report when provided with a data set. Try this using the latter function, on the nycflights13 data, using this code:

# DataExplorer
library(DataExplorer)
library(nycflights13)
library(tidyverse)

# Create a big data set
airlines_all <- flights |> 
  full_join(airlines, by = "carrier") |>
  full_join(planes, by = "tailnum", 
            suffix = c("_flights", "_planes")) |>
  full_join(airports, by = c("origin"="faa"), 
            suffix = c("_carrier", "_origin")) |>
  full_join(airports, by = c("dest"="faa"), 
            suffix = c("_origin", "_dest"))
create_report(airlines_all, y = "arr_delay")

The code instructs to use arr_delay as a “response variable”. Give some reasons why this report is not very useful!

  1. Have your Generative AI assistant suggest what to look at when it is given the variable summary, from:
glimpse(airlines_all)

8.

In a limited fashion, lets work through some areas suggested by Claude.

  1. Use visdat to examine the variable types and missing values. You’ll need to take a sample of the data because there are too many observations to reasonably plot. However, a sample should give reasonable insight on the reliability of most variables. What variables may not be useful because they have too many missing values?

For the rest of these questions, you decide how to process the data, make summaries or plots to provide answers.

  1. Which carrier had the most flights?
  2. Is this the same for each month? Or day of the week?
  3. Are there more departure delays for flights in the morning hours, or evening hours?
  4. Find an error in the data, e.g. a flight that arrived before it left.
  5. With your neighbour in the tutorial come up with one thing that is a bit surprising to you that you can learn from this data. Make sure you state what you expected to see, and why what you saw was then a surprise. (It is possible that you can use the DataExplorer report to look at something you had not thought to examine, as motivation.)

9.

Table 2 of the Landscape paper summarises the activities of two early phases of the CRISP-DM standard. What does CRISP-DM mean? The implication is that EDA is related to “data understanding” and “data preparation”. Would you agree with this or disagree? Why?

10.

Table 1 of the paper is summarising CRAN downloads and GitHub activity is hard to read. How are the rows sorted? What is the most important information communicated by the table? In what way(s) might revising this table make it easier to read and digest the most important information?

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.