ETC5521: Diving Deeply into Data Exploration

Working with a single variable, making transformations, detecting outliers, using robust statistics

Professor Di Cook

Department of Econometrics and Business Statistics

Quantitative variables

Features of a single quantitative variable

Feature	Example	Description
Asymmetry		The distribution is not symmetrical.
Outliers		Some observations are that are far from the rest.
Multimodality		There are more than one "peak" in the observations.
Gaps		Some continuous interval that are contained within the range but no observations exists.
Heaping		Some values occur unexpectedly often.
Discretized		Only certain values are found, e.g. due to rounding.
Implausible		Values outside of plausible or likely range.

Numerical features of a single quantitative variables

A measure of central tendency, e.g. mean, median and mode
A measure of dispersion (also called variability or spread), e.g. variance, standard deviation and interquartile range
There are other measures, e.g. skewness and kurtosis that measures “tailedness”, but these are not as common as the measures of first two
The mean is also the first moment and variance, skewness and kurtosis are second, third, and fourth central moments

Significance tests or hypothesis tests

Testing for $H_0: \mu = \mu_0$ vs. $H_1: \mu \neq \mu_0$ (often $\mu_0 = 0$)
The $t$-test is commonly used if the underlying data are believed to be normally distributed

2019 Australian Federal Election (1/8)

Context

There are 151 seats in the House of Representative for the 2019 Australian federal election
The major parties in Australia are:
- the Coalition, comprising of the:
  - Liberal,
  - Liberal National (Qld),
  - National, and
  - Country Liberal (NT) parties, and
- the Australian Labor party
The Greens party is a small but notable party

Source: PRObono

2019 Australian Federal Election (2/8)

Data source: Australian Electoral Commission. (2019)

2019 Australian Federal Election (3/8)

What is the number of the seats won in the House of Representatives by parties?

📊
data
R

Party	# of seats
Coalition	77
Liberal	44
Liberal National Party Of Queensland	23
The Nationals	10
Australian Labor Party	68
The Greens	1
Centre Alliance	1
Katter's Australian Party (Kap)	1
Independent	3

What does this table tell you?

The Coalition won the government
Labor and Coalition hold majority of the seats in the House of Representatives (lower house)
Parties such as The Greens, Centre Alliance and Katter’s Australian Party (KAP) won only a single seat

Only?

Wait… Did the parties compete in all electoral districts?

skimr::skim(df1)

── Data Summary ────────────────────────
                           Values
Name                       df1   
Number of rows             1207  
Number of columns          18    
_______________________          
Column type frequency:           
  character                11    
  numeric                  7     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────
   skim_variable   n_missing complete_rate min max empty
 1 StateAb                 0         1       2   3     0
 2 DivisionID              0         1       3   3     0
 3 DivisionNm              0         1       4  15     0
 4 CandidateID             0         1       3   5     0
 5 Surname                 0         1       2  18     0
 6 GivenNm                 0         1       1  25     0
 7 BallotPosition          0         1       1   3     0
 8 Elected                 0         1       1   1     0
 9 HistoricElected         0         1       1   1     0
10 PartyAb               151         0.875   2   4     0
11 PartyNm                 2         0.998   5  61     0
   n_unique whitespace
 1        8          0
 2      151          0
 3      151          0
 4     1057          0
 5      890          0
 6      613          0
 7       14          0
 8        2          0
 9        2          0
10       40          0
11       45          0

── Variable type: numeric ──────────────────────────────────
  skim_variable    n_missing complete_rate     mean       sd
1 OrdinaryVotes            0             1 10401.   12446.  
2 AbsentVotes              0             1   511.     569.  
3 ProvisionalVotes         0             1    41.4     51.7 
4 PrePollVotes             0             1   514.     607.  
5 PostalVotes              0             1  1033.    1476.  
6 TotalVotes               0             1 12501.   14860.  
7 Swing                    0             1     1.07     4.26
     p0     p25     p50      p75    p100 hist 
1 167   1867    4317    14768.   54535   ▇▁▁▁▁
2  13    117     246      711     3287   ▇▂▁▁▁
3   0      8      20       56      444   ▇▁▁▁▁
4  11    108.    211      761     5248   ▇▂▁▁▁
5  14    181     317     1216.    9837   ▇▁▁▁▁
6 250   2348    5196    18142    61202   ▇▁▁▁▁
7 -28.1   -0.73    1.21     2.75    43.5 ▁▆▇▁▁

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
  skip = 1,
  col_types = cols(
    .default = col_character(),
    OrdinaryVotes = col_double(),
    AbsentVotes = col_double(),
    ProvisionalVotes = col_double(),
    PrePollVotes = col_double(),
    PostalVotes = col_double(),
    TotalVotes = col_double(),
    Swing = col_double()
  )
)

recode_party_names <- c(
  "Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party",
  "Labor" = "Australian Labor Party",
  "The Greens (Vic)" = "The Greens",
  "The Greens (Wa)" = "The Greens",
  "Katter's Australian Party (KAP)" = "Katter's Australian Party",
  "Country Liberals (Nt)" = "Country Liberals (NT)"
)

tdf1 <- df1 |>
  filter(Elected == "Y") |>
  mutate(
    PartyNm = str_to_title(PartyNm),
    PartyNm = recode(PartyNm, !!!recode_party_names)
  ) |>
  count(PartyNm, sort = TRUE) |>
  slice(2:4, 1, 8, 6, 7, 5)

data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) |>
  rbind(tdf1) |>
  knitr::kable(col.names = c("Party", "# of seats")) |>
  kableExtra::kable_classic() |>
  kableExtra::kable_styling(    
    full_width = FALSE,
    font_size = 24
  ) |>
  kableExtra::add_indent(2:4) |>
  kableExtra::row_spec(2:4, color = "#C8C8C8")

What do you notice from this table?

The Greens are represented in every electoral districts
United Australia Party is the only other non-major party to be represented in every electoral district
KAP is represented in 7 electoral districts
Centre Alliance is only represented in 3 electoral districts!

Let’s have a closer look at the Greens party…

tdf2 <- df1 |>
  mutate(
    PartyNm = str_to_title(PartyNm),
    PartyNm = recode(PartyNm, !!!recode_party_names)
  ) |>
  count(PartyNm, sort = TRUE)

You can omit table_options and toggle_select or have a look at the source Rmd to find out what it is

tdf2 |>
  DT::datatable(
    rownames = FALSE,
    escape = FALSE,
    width = "900px",
    options = table_options(
      scrollY = "400px",
      title = "Australian Federal Election 2019 - Party Distribution",
      csv = "aus-election-2019-party-dist"
    ),
    elementId = "tab1B",
    colnames = c("Party", "# of electorates"),
    callback = toggle_select
  )

2019 Australian Federal Election (5/8)

📊
data
R

What does this graph tell you?

Majority of the country does not have first preference for the Greens
Some constituents are slightly more supportive than the others

What further questions does it raise?

Notes:

Australia uses full-preference instant-runoff voting in single member seats
Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election.
In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party.

skimr::skim(tdf3)

── Data Summary ────────────────────────
                           Values
Name                       tdf3  
Number of rows             151   
Number of columns          6     
_______________________          
Column type frequency:           
  character                3     
  numeric                  3     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────
  skim_variable n_missing complete_rate min max empty
1 DivisionID            0             1   3   3     0
2 DivisionNm            0             1   4  15     0
3 State                 0             1   2   3     0
  n_unique whitespace
1      151          0
2      151          0
3        8          0

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate     mean      sd
1 votes_GRN             0             1  9821.   5581.  
2 votes_total           0             1 99925.   9801.  
3 perc_GRN              0             1     9.87    5.63
        p0      p25       p50      p75     p100 hist 
1  2744     6555      8676     11532.   45876   ▇▂▁▁▁
2 51009    96372.   100936    105588   116216   ▁▁▁▇▅
3     2.89     6.43      8.55     11.4     47.8 ▇▂▁▁▁

tdf3 <- df1 |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 |>
  ggplot(aes(perc_GRN)) +
  geom_histogram(color = "white", fill = "#00843D") +
  labs(
    x = "First preference votes %",
    y = "Count",
    title = "Greens party"
  )

Formulating questions for EDA vs making observations from a plot

BEFORE plotting or making summaries think broad (open-ended) questions about the distribution of values
Questions with simple answers (i.e. yes or no) less helpful in encouraging exploration using graphics
For example,
- What is the distribution of the first preference vote percentages for the Labor party across Australia?
- Is it evenly spread across electorates or are there clusters of popularity?

AFTER plotting or making summaries think was this what you expected, are there any surprises. Detail what you learn, and how you should follow up on these observations.

Is the outlying observation the electoral district that won the seat?

Visual inference

Suitable null models for a single variable all focus on potential distributions.

Typical plot description:

ggplot(data, aes(x=var1)) +
  geom_histogram()

Is the distribution consistent with a sample from a

normal distribution?
uniform distribution?
skewed distribution?
MANY OTHER POTENTIAL DISTRIBUTIONS

Potential simulation methods from specific distributions

# Symmetric, unimodal, bell-shaped
null_dist("var1", "norm")
null_dist("var1", "cauchy")
null_dist("var1", "t")

# Skewed right
null_dist("var1", "exp")
null_dist("var1", "chisq")
null_dist("var1", "gamma")

# Constant 
null_dist("var1", "uniform")

Lineup of Greens first preference percentages

📊
Explanation
R

Using the exponential distribution as the null says that we expect most electorates to have small tallies for Greens, and only a few electorates will have large, potentially winnable tallies.

NOTE: We’ve already seen the data so we can’t be impartial judges for choosing the most different plot. We can use the null plots to check whether the small mode of moderately high tallies is unusual if the tallies really are samples from an exponential distribution.

library(nullabor)
set.seed(241)
ggplot(lineup(null_dist("perc_GRN", "exp"), tdf3, n=10),
       aes(x=perc_GRN)) +
  geom_histogram(color = "white", fill = "#00843D", bins = 30) +
  facet_wrap(~.sample, ncol=5, scales="free") +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        panel.grid.major = element_blank())

2019 Australian Federal Election (6/8)

📊
data
R

	% of first preference for the Greens
State	Mean	Median	SD	IQR	Skewness	Kurtosis
ACT	16.4	14.0	5.6	5.20	0.65	1.5
VIC	11.4	8.6	8.2	6.72	2.60	11.4
WA	11.0	10.8	3.0	3.12	0.80	3.0
QLD	9.8	8.8	5.1	4.75	1.09	3.9
TAS	9.7	9.3	4.0	0.98	0.33	2.5
NT	9.6	9.6	2.5	1.75	0.00	1.0
SA	9.1	8.9	3.0	3.41	0.38	2.9
NSW	8.1	6.6	4.1	3.95	1.50	4.9
National	9.9	8.5	5.6	5.00	2.67	15.8

Why are the means and the medians different?
How are the standard deviations and the interquartile ranges similar or different?
Are there some other numerical statistics we should show?

tdf3 <- df1 |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 |>
  group_by(State) |>
  summarise(
    mean = mean(perc_GRN),
    median = median(perc_GRN),
    sd = sd(perc_GRN),
    iqr = IQR(perc_GRN),
    skewness = moments::skewness(perc_GRN),
    kurtosis = moments::kurtosis(perc_GRN)
  ) |>
  arrange(desc(mean)) |>
  rbind(data.frame(
    State = "National",
    mean = mean(tdf3$perc_GRN),
    median = median(tdf3$perc_GRN),
    sd = sd(tdf3$perc_GRN),
    iqr = IQR(tdf3$perc_GRN),
    skewness = moments::skewness(tdf3$perc_GRN),
    kurtosis = moments::kurtosis(tdf3$perc_GRN)
  )) |>
  knitr::kable(col.names = c("State", "Mean", "Median", "SD", "IQR", "Skewness", "Kurtosis"), digits = 3) |>
  kableExtra::kable_classic() |>
  kableExtra::add_header_above(c(" ", "% of first preference for the Greens" = 4, " " = 2)) |>
  kableExtra::row_spec(9, extra_css = "border-top: 2px solid black;")

Robust measure of central tendency

Mean is a non-robust measure of location.
Median is the 50% quantile of the observations
Trimmed mean is the sample mean after discarding observations at the tails.
Winsorized mean is the sample mean after replacing observations at the tails with the minimum or maximum of the observations that remain.

Both trimmed and Winsorized mean trimmed 20% of the tails.

Plot	Mean	Median	Trimmed Mean	Winsorized Mean
1	0.109	0.114	0.120	0.103
2	0.054	-0.045	-0.016	-0.029
3	1.177	0.729	0.820	0.888
4	0.533	0.541	0.543	0.542
5	0.468	0.329	0.355	0.390
6	5.626	6.656	5.918	5.688

Robust measure of dispersion

Standard deviation or its square, variance, is a popular choice of measure of dispersion but is not robust to outliers
Standard deviation for sample $x_1, ..., x_n$ is

\[\sqrt{\sum_{i=1}^n \frac{(x_i - \bar{x})^2}{n - 1}}\]

Interquartile range difference between 1st and 3rd quartile, more robust measure of spread
Median absolute deviance (MAD) is even more robust

\[\text{median}(|x_i - \text{median}(x_i)|)\]

	Measure of dispersion
Plot	SD	IQR	MAD	Skewness	Kurtosis
1	0.90	1.19	0.87	-0.072	3.0
2	0.99	1.41	1.08	0.358	2.2
3	1.33	1.18	0.79	1.944	7.2
4	0.29	0.45	0.34	-0.126	1.8
5	0.47	0.50	0.34	1.691	6.4
6	2.78	5.36	2.98	-0.351	1.7

Inference for robust statistics

We have seen the re-sampling methods simulation and permutation used for generating null plots in a lineup. Re-sampling methods can be used with numeric statistics also.

Simulation from distribution, can be used to to check for outliers.

We can also compute how many simulated values are more than the observed which gives a simulation $p$-value: 0.61.

For sample means, conventional tests provide a means for assessing what might be observed if different samples were taken.

Bootstrapping the current sample, can be used for robust statistics. If we have a sample of values:

[1] 2 2 3 6 7 7 8 8

to bootstrap sample with replacement:

sort(sample(x, replace=TRUE))

[1] 2 2 3 3 7 7 7 7

sort(sample(x, replace=TRUE))

[1] 2 3 6 6 6 6 8 8

Here’s an example of bootstrapping to get a confidence interval for a median.

[1] "Median: 6.34"

[1] "95% CI: ( 4.99 , 9.16 )"

2019 Australian Federal Election (7/8)

📊
data
R

Where are these electorates?

The width of the boxplot is proportional to the number of electoral districts in the corresponding state (which is roughly proportional to the population).

tdf3 <- df1 |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(perc_GRN, State)) +
  geom_boxplot(varwidth = TRUE) +
  labs(
    x = "First preference votes %",
    y = "Count",
    title = "Greens party"
  )

Outliers

Outliers are observations that are significantly different from the majority.

Outliers can occur by chance in almost all distributions, but could be indicative of:
- a measurement error,
- a different population, or
- an issue with the sampling process.

Closer look at the boxplot

Observations that are outside the range of lower to upper fence (1.5 times the box length) are often referred to as outliers.
Plotting boxplots for data from a skewed distribution will almost always show these “outliers” but these are not necessarily outliers.
Some definitions of outliers assume a symmetrical population distribution (e.g. in boxplots or observations a certain standard deviations away from the mean) and these definitions are ill-suited for asymmetrical distributions.
Declaring observations outliers typically requires additional data context.

What cannot be seen from boxplots?

2019 Australian Federal Election (8/8)

📊
data
R

Now what do you notice from this graph that you didn’t notice before?

Only two electoral districts in NT.
And only 3 and 5 electoral districts in ACT and TAS, respectively!
Boxplots requires 5 points!
We should have summarised the number of electoral districts for each state with numerical statistics as a first step.
Also the outlier (yes, safe to call this an outlier!) and the cluster in the Victoria electorates.

tdf3 <- df1 |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(perc_GRN, State)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
  labs(
    x = "First preference votes %",
    y = "State",
    title = "Greens party"
  )

Both numerical and graphical summaries can reveal and/or hide aspects of the data

Transformations

Melbourne Housing Prices (1/6)

Suburb	Rooms	Type	Price ($)	Date
Abbotsford	3	Home	1,490,000	2017-04-01
Abbotsford	3	Home	1,220,000	2017-04-01
Abbotsford	3	Home	1,420,000	2017-04-01
Aberfeldie	3	Home	1,515,000	2017-04-01
Airport West	2	Home	670,000	2017-04-01
Airport West	2	Townhouse	530,000	2017-04-01
Airport West	2	Unit	540,000	2017-04-01
Airport West	3	Home	715,000	2017-04-01
Albanvale	6	Home	NA	2017-04-01
Albert Park	3	Home	1,925,000	2017-04-01
Albion	3	Unit	515,000	2017-04-01
Albion	4	Home	717,000	2017-04-01
Alphington	2	Home	1,675,000	2017-04-01
Alphington	4	Home	2,008,000	2017-04-01
Altona	2	Home	860,000	2017-04-01
Altona Meadows	4	Home	NA	2017-04-01
Altona North	3	Home	720,000	2017-04-01
Armadale	2	Unit	836,000	2017-04-01
Armadale	2	Home	2,110,000	2017-04-01
Armadale	3	Home	1,386,000	2017-04-01

This data was scraped each week from domain.com.au from 2016-01-28 to 2018-10-13
In total there are 63,023 observations
All variables shown (there are more variables not shown here), except price, have complete records
The are 48,433 property prices across Melbourne (roughly 23% missing)

Data source: Tony Pio (2018) Melbourne Housing Market

How would you explore this data first?

Yes, with an overview plot.

Melbourne Housing Prices (2/6)

📊
data
R
lineup
R

Is missingness more likely for expensive houses?

Check with a lineup

To impute missings other variables will need to be used.

Note: Houses with more than 8 rooms removed. Why?

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
  col_types = cols(
    .default = col_character(),
    Rooms = col_double(),
    Price = col_double(),
    Date = col_date(format = "%d/%m/%Y"),
    Propertycount = col_double(),
    Distance = col_double()
  )
)

skimr::skim(df2)

── Data Summary ────────────────────────
                           Values
Name                       df2   
Number of rows             63023 
Number of columns          13    
_______________________          
Column type frequency:           
  character                8     
  Date                     1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────
  skim_variable n_missing complete_rate min max empty
1 Suburb                0             1   3  18     0
2 Address               0             1   7  27     0
3 Type                  0             1   1   1     0
4 Method                0             1   1   2     0
5 SellerG               0             1   1  27     0
6 Postcode              0             1   4   4     0
7 Regionname            0             1  16  26     0
8 CouncilArea           0             1  17  30     0
  n_unique whitespace
1      380          0
2    57754          0
3        3          0
4        9          0
5      476          0
6      225          0
7        8          0
8       34          0

── Variable type: Date ─────────────────────────────────────
  skim_variable n_missing complete_rate min       
1 Date                  0             1 2016-01-28
  max        median     n_unique
1 2018-10-13 2017-09-03      112

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate      mean         sd
1 Rooms                 0         1          3.11      0.958
2 Price             14590         0.768 997898.   593499.   
3 Propertycount         0         1       7618.     4424.   
4 Distance              0         1         12.7       7.59 
     p0    p25      p50       p75       p100 hist 
1     1      3      3         4         31   ▇▁▁▁▁
2 85000 620000 830000   1220000   11200000   ▇▁▁▁▁
3    39   4380   6795     10412      21650   ▅▇▅▂▁
4     0      7     11.4      16.7       64.1 ▇▆▁▁▁

df2 |>
  select(Suburb, Rooms, Type, Price, Date) |>
  arrange(Suburb, Date) |>
  visdat::vis_miss()

df2 |>
  mutate(miss = ifelse(is.na(Price), 
    "Missing", "Recorded")) |>
  count(Rooms, miss) |>
  filter(Rooms < 8) |>
  group_by(miss) |>
  mutate(perc = n / sum(n) * 100) |>
  ggplot(aes(as.factor(Rooms), perc, fill = miss)) +
    geom_col(position = "dodge") +
    scale_fill_viridis_d(begin=0.3, end=0.7) +
    labs(x = "Rooms", y = "Percentage", fill = "Price") +
    theme(aspect.ratio = 0.8)

Is there a suspicious plot?

Break the association between Rooms and Missing/Not on Price, because the null hypothesis is that there is no difference in missing status for price based on the size of the house. Why?

library(nullabor)
df2_d <- df2 |>
  mutate(miss = ifelse(is.na(Price), "Missing", "Recorded")) |>
  select(Rooms, miss) |>
  filter(Rooms < 8)
df2_l <- lineup(null_permute("miss"), df2_d, n=10, pos=7) 
df2_l_agg <- df2_l |>
  group_by(.sample) |>
  count(Rooms, miss) |>
  ungroup() |>
  group_by(miss) |>
  mutate(perc = n / sum(n) * 100) |>
  mutate(Rooms = as.factor(Rooms))
ggplot(df2_l_agg, aes(x=Rooms, y=perc, fill = miss)) +
  geom_col(position = "dodge") +
  scale_fill_viridis_d(begin=0.3, end=0.7) +
  facet_wrap(~.sample, ncol=5) +
  theme(legend.position = "none", 
        axis.text = element_blank(),
        axis.title = element_blank(),
        panel.grid.major.x = element_blank())

🧩 Your turn

What might be alternative plots? Especially to reveal the relationship more clearly.

Check the support of your data

If you have too few measurements in any region (extreme), summaries for these regions will be unreliable.

For quantitative variables, it may be necessary to remove extremes.
If the variable is categorical it might be best to combine levels.
It is important to script so decisions can be reversed or rare events are not ignored.

We removed houses with 8 or more rooms. What other way might we have handled these houses?

Melbourne Housing Prices (3/6)

📊
data
R

What can we say from this plot?

The housing prices are right-skewed
There appears to be a lot of outlying housing prices (how can we tell?)

Note: We determined that it is likely that more higher price houses have not disclosed the sale price. The distribution of price will need to be checked again after imputation.

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
  col_types = cols(
    .default = col_character(),
    Rooms = col_double(),
    Price = col_double(),
    Date = col_date(format = "%d/%m/%Y"),
    Propertycount = col_double(),
    Distance = col_double()
  )
)

df2 |>
  ggplot(aes(Price / 1e6)) +
  geom_histogram(color = "white") +
  labs(
    x = "Price (mil)",
    y = "Count"
  )

Melbourne Housing Prices (4/6)

📊
data
R

The x-axis has been $\log_{10}$-transformed in this plot
The plot appears more symmetrical now
What is a useful measure of central tendency here?

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
  col_types = cols(
    .default = col_character(),
    Rooms = col_double(),
    Price = col_double(),
    Date = col_date(format = "%d/%m/%Y"),
    Propertycount = col_double(),
    Distance = col_double()
  )
)

df2 |>
  ggplot(aes(Price / 1e6)) +
  geom_histogram(color = "white") +
  labs(
    x = "Price (mil)",
    y = "Count"
  ) +
  scale_x_log10()

Melbourne Housing Prices (5/6)

Central tendency
R

With no transformation:

Mean	Median	Trimmed Mean	Winsorised Mean
$997,898	$830,000	$871,375	$903,823

With log transformation (and back-transformed to original scale):

Mean	Median	Trimmed Mean	Winsorised Mean
$874,166	$830,000	$847,973	$859,325

df2 |>
  filter(!is.na(Price)) |>
  summarise(
    Mean = scales::dollar(mean(Price)),
    Median = scales::dollar(median(Price)),
    `Trimmed Mean` = scales::dollar(mean(Price, trim = 0.2)),
    `Winsorised Mean` = scales::dollar(psych::winsor.mean(Price))
  ) |>
  knitr::kable(align = "r") |>
  kableExtra::kable_classic() |>
  kableExtra::kable_styling(full_width=FALSE)

df2 |>
  filter(!is.na(Price)) |>
  mutate(lPrice = log10(Price)) |>
  summarise(
    Mean = scales::dollar(10^mean(lPrice)),
    Median = scales::dollar(10^median(lPrice)),
    `Trimmed Mean` = scales::dollar(10^mean(lPrice, trim = 0.2)),
    `Winsorised Mean` = scales::dollar(10^psych::winsor.mean(lPrice))
  ) |>
  knitr::kable(align = "r") |>
  kableExtra::kable_classic() |>
  kableExtra::kable_styling(full_width=FALSE)

General rules for transform quantitative variables

Non-shape changing, scaling:

standardise to mean 0, sd 1
standardise to min 0, max 1
z-score

Shape changing, transformations: Remember the ladder of power transformations. (eg transforming left-skewed to more uniform using $^2$)

Distribution changing: quantile

Some features cannot be fixed: gaps, multimodality, heaping. You need to find some explaining variable.

Some features can be artificially fixed: discreteness. If regularly discretized, add random uniform noise to spread equally between gaps.

Multi-modality

Melbourne Housing Prices (6/6)

📊
data
R

Distribution from side-by-side univariate plots shows that higher number of rooms generally are pricier.
This strata could be responsible for multimodality in price distribution, even though it is not visible in the histogram.
Accounting for rooms is important.

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
  col_types = cols(
    .default = col_character(),
    Rooms = col_double(),
    Price = col_double(),
    Date = col_date(format = "%d/%m/%Y"),
    Propertycount = col_double(),
    Distance = col_double()
  )
)

df2 |>
  filter(Rooms < 8) |>
  ggplot(aes(x=as.factor(Rooms), y=Price / 1e6, )) +
  ggbeeswarm::geom_quasirandom(varwidth=TRUE, alpha=0.3) +
  scale_y_log10() +
  labs(y = "Price (mil)", x = "# of Rooms")

Bins and Bandwidths: More details

Hidalgo stamps thickness

📊
data
R

Famous historical example

A stamp collector, Walton von Winkle, bought several collections of Mexican stamps from 1872-1874 and measured the thickness of all of them.
The different bandwidth for the density plot suggests different possibilities for number of modes.

Which do you think most accurately reflects what’s in the data?

load(here::here("data/Hidalgo1872.rda"))
skimr::skim(Hidalgo1872)

── Data Summary ────────────────────────
                           Values     
Name                       Hidalgo1872
Number of rows             485        
Number of columns          3          
_______________________               
Column type frequency:                
  numeric                  3          
________________________              
Group variables            None       

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate   mean      sd    p0
1 thickness             0         1     0.0860 0.0150  0.06 
2 thicknessA          195         0.598 0.0922 0.0162  0.068
3 thicknessB          289         0.404 0.0768 0.00508 0.06 
     p25   p50   p75  p100 hist 
1 0.075  0.08  0.098 0.131 ▅▇▃▂▁
2 0.0772 0.092 0.105 0.131 ▇▃▆▃▂
3 0.072  0.078 0.08  0.097 ▁▃▇▁▁

hid <- ggplot(Hidalgo1872, aes(x=thickness, y=25)) +
  geom_quasirandom(width=15, alpha=0.5, size=1) +
  labs(x = "Thickness (0.001 mm)", y = "Density") + 
  ylim(c(0, 70)) +
  theme(aspect.ratio = 0.6) 
hid_p1 <- hid +
  geom_density(aes(x=thickness), alpha=0.5,
    color = "#E16A86", bw = 0.01, linewidth=2,
    inherit.aes = FALSE)
hid_p2 <- hid +
  geom_density(aes(x=thickness), alpha=0.5,
    color = "#E16A86", bw = 0.0075, linewidth=2,
    inherit.aes = FALSE) 
hid_p3 <- hid +
  geom_density(aes(x=thickness), alpha=0.5,
    color = "#E16A86", bw = 0.004, linewidth=2,
    inherit.aes = FALSE) 
hid_p4 <- hid +
  geom_density(aes(x=thickness), alpha=0.5,
    color = "#E16A86", bw = 0.001, linewidth=2,
    inherit.aes = FALSE) 
hid_p1 + hid_p2 + hid_p3 + hid_p4 + plot_layout(ncol=2)

Olive oil content

What do you see?

Mixture of discreteness and normal shape of continuous values. Why might this happen?

Check if there is a difference in the strata (here 1 thru 9), implying measurement policy differences.

Re-focus

Movie length

📊
data
R

Upon further exploration, you can find the two movies that are well over 16 hours long are “Cure for Insomnia”, “Four Stars”, and “Longest Most Meaningless Movie in the World”
We can restrict our attention to films under 3 hours:

Notice that there is a peak at particular times. Why do you think so?

data(movies, package = "ggplot2movies")
skimr::skim(movies)

── Data Summary ────────────────────────
                           Values
Name                       movies
Number of rows             58788 
Number of columns          24    
_______________________          
Column type frequency:           
  character                2     
  numeric                  22    
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────
  skim_variable n_missing complete_rate min max empty
1 title                 0             1   1 121     0
2 mpaa                  0             1   0   5 53864
  n_unique whitespace
1    56007          0
2        5          0

── Variable type: numeric ──────────────────────────────────
   skim_variable n_missing complete_rate          mean
 1 year                  0        1          1976.    
 2 length                0        1            82.3   
 3 budget            53573        0.0887 13412513.    
 4 rating                0        1             5.93  
 5 votes                 0        1           632.    
 6 r1                    0        1             7.01  
 7 r2                    0        1             4.02  
 8 r3                    0        1             4.72  
 9 r4                    0        1             6.37  
10 r5                    0        1             9.80  
11 r6                    0        1            13.0   
12 r7                    0        1            15.5   
13 r8                    0        1            13.9   
14 r9                    0        1             8.95  
15 r10                   0        1            16.9   
16 Action                0        1             0.0797
17 Animation             0        1             0.0628
18 Comedy                0        1             0.294 
19 Drama                 0        1             0.371 
20 Documentary           0        1             0.0591
21 Romance               0        1             0.0807
22 Short                 0        1             0.161 
             sd   p0      p25       p50        p75
 1       23.7   1893   1958      1983       1997  
 2       44.3      1     74        90        100  
 3 23350085.       0 250000   3000000   15000000  
 4        1.55     1      5         6.1        7  
 5     3830.       5     11        30        112  
 6       10.9      0      0         4.5        4.5
 7        5.96     0      0         4.5        4.5
 8        6.45     0      0         4.5        4.5
 9        7.59     0      0         4.5        4.5
10        9.73     0      4.5       4.5       14.5
11       11.0      0      4.5      14.5       14.5
12       11.6      0      4.5      14.5       24.5
13       11.3      0      4.5      14.5       24.5
14        9.44     0      4.5       4.5       14.5
15       15.7      0      4.5      14.5       24.5
16        0.271    0      0         0          0  
17        0.243    0      0         0          0  
18        0.455    0      0         0          1  
19        0.483    0      0         0          1  
20        0.236    0      0         0          0  
21        0.272    0      0         0          0  
22        0.367    0      0         0          0  
          p100 hist 
 1      2005   ▁▁▃▃▇
 2      5220   ▇▁▁▁▁
 3 200000000   ▇▁▁▁▁
 4        10   ▁▃▇▆▁
 5    157608   ▇▁▁▁▁
 6       100   ▇▁▁▁▁
 7        84.5 ▇▁▁▁▁
 8        84.5 ▇▁▁▁▁
 9       100   ▇▁▁▁▁
10       100   ▇▁▁▁▁
11        84.5 ▇▂▁▁▁
12       100   ▇▃▁▁▁
13       100   ▇▃▁▁▁
14       100   ▇▁▁▁▁
15       100   ▇▃▁▁▁
16         1   ▇▁▁▁▁
17         1   ▇▁▁▁▁
18         1   ▇▁▁▁▃
19         1   ▇▁▁▁▅
20         1   ▇▁▁▁▁
21         1   ▇▁▁▁▁
22         1   ▇▁▁▁▂

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") +
  labs(x = "Length of movie (minutes)", y = "Frequency") +
  theme(aspect.ratio = 0.6)

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") +
  labs(x = "Length of movie (minutes)", y = "Frequency") +
  scale_x_log10() +
  theme(aspect.ratio = 0.6)

movies |>
  filter(length < 180) |>
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "#795549", color = "black") +
  labs(x = "Length of movie (minutes)", y = "Frequency")

Categorical variables

There are two types of categorical variables

Nominal where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.

Ordinal where there is a clear order to the categories.
E.g. Strongly disagree, disagree, neutral, agree, strongly agree.

Categorical variables in R

In R, categorical variables may be encoded as factors.

data <- c(2, 2, 1, 1, 3, 3, 3, 1)
factor(data)

[1] 2 2 1 1 3 3 3 1
Levels: 1 2 3

You can easily change the labels of the variables:

factor(data, labels = c("I", "II", "III"))

[1] II  II  I   I   III III III I  
Levels: I II III

Order of the factors are determined by the input:

# numerical input are ordered in increasing order 
factor(c(1, 3, 10))

[1] 1  3  10
Levels: 1 3 10

# character input are ordered by first char, alphabetically 
factor(c("1", "3", "10"))

[1] 1  3  10
Levels: 1 10 3

# you can specify order of levels explicitly 
factor(c("1", "3", "10"),
  levels = c("1", "3", "10")
)

[1] 1  3  10
Levels: 1 3 10

Numerical summaries: counts, proportions, percentages and odds

Tuberculosis counts in Australia

Code

options(digits=2)
tb_oz |>
  filter(year >= 2000) |>
  mutate(p = count/sum(count),
         pct = p*100, 
         odds = count/count[year==2000]) |>
  print(n=100)

# A tibble: 22 × 7
   country   iso3   year count      p   pct  odds
   <chr>     <chr> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 Australia AUS    2000   982 0.0522  5.22 1    
 2 Australia AUS    2001   953 0.0507  5.07 0.970
 3 Australia AUS    2002  1008 0.0536  5.36 1.03 
 4 Australia AUS    2003   926 0.0493  4.93 0.943
 5 Australia AUS    2004  1036 0.0551  5.51 1.05 
 6 Australia AUS    2005  1030 0.0548  5.48 1.05 
 7 Australia AUS    2006  1127 0.0600  6.00 1.15 
 8 Australia AUS    2007  1081 0.0575  5.75 1.10 
 9 Australia AUS    2008  1182 0.0629  6.29 1.20 
10 Australia AUS    2009  1176 0.0626  6.26 1.20 
11 Australia AUS    2010  1146 0.0610  6.10 1.17 
12 Australia AUS    2011  1202 0.0640  6.40 1.22 
13 Australia AUS    2012  1259 0.0670  6.70 1.28 
14 Australia AUS    2013   512 0.0272  2.72 0.521
15 Australia AUS    2014   474 0.0252  2.52 0.483
16 Australia AUS    2015   438 0.0233  2.33 0.446
17 Australia AUS    2016   481 0.0256  2.56 0.490
18 Australia AUS    2017   524 0.0279  2.79 0.534
19 Australia AUS    2018   502 0.0267  2.67 0.511
20 Australia AUS    2019   554 0.0295  2.95 0.564
21 Australia AUS    2020   609 0.0324  3.24 0.620
22 Australia AUS    2021   593 0.0316  3.16 0.604

For qualitative data, compute

count/frequency,
proportion/percentage
and sometimes, an odds ratio. Here we have used ratio relative to the count in year 2000.

Note: For exploration, no rounding of digits was done, but to report you would need to make the numbers pretty.

2019 Australian Federal Election

📊
data
R

Sorting levels sensibly is (almost) always better when plotting

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
  skip = 1,
  col_types = cols(
    .default = col_character(),
    OrdinaryVotes = col_double(),
    AbsentVotes = col_double(),
    ProvisionalVotes = col_double(),
    PrePollVotes = col_double(),
    PostalVotes = col_double(),
    TotalVotes = col_double(),
    Swing = col_double()
  )
)

tdf3 <- df1 |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 |>
  ggplot(aes(perc_GRN, State)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
  labs(
    x = "First preference votes %",
    y = "State",
    title = "Greens party"
  )

tdf3 |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(perc_GRN, State)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
  labs(
    x = "First preference votes %",
    y = "State",
    title = "Greens party"
  )

Order nominal variables meaningfully

Coding tip: use below functions to easily change the order of factor levels

stats::reorder(factor, value, mean)
forcats::fct_reorder(factor, value, median)
forcats::fct_reorder2(factor, value1, value2, func)

Visual inference

Typical plot description:

ggplot(data, aes(x=var1)) +
  geom_col()

ggplot(data, aes(x=var1)) +
  geom_bar()

Is the distribution consistent with a sample from a binomial distribution with a given p?

Potential simulation method from binomial

# Only one option
null_dist("var1", "binom", 
  list(size=n, p=phat))

Lineup of tuberculosis count between sexes

📊
Use conventional test
R

For this problem there is nothing more to learn from a lineup that what can be learned from a conventional hypothesis test of $H_0: p=0.5$.

binom.test(tb_oz_2012$count, n, p = 0.5, alternative = "two.sided")


    Exact binomial test

data:  tb_oz_2012$count
number of successes = 314, number of trials = 997,
p-value <2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.29 0.34
sample estimates:
probability of success 
                  0.31

library(nullabor)
set.seed(252)
tb_oz_2012 <- tb |>
  filter(iso3 == "AUS",
         year == 2012) |>
  select(iso3, year, new_sp_m04:new_ep_m65) |>
  pivot_longer(cols=new_sp_m04:new_ep_m65, 
               names_to = "var", 
               values_to = "count") |>
  separate(var, into=c("new", "type", "sexage")) |>
  select(-new) |>
  filter(!(sexage %in% c("sexunk014", "sexunk04", 
                         "sexunk15plus", "sexunk514",
                         "f15plus", "m15plus"))) |>
  group_by(sexage) |>
  summarise(count = sum(count, na.rm=TRUE)) |>
  mutate(sex=str_sub(sexage, 1, 1),
         age=str_sub(sexage, 2, str_length(sexage))) |>
  group_by(sex) |>
  summarise(count=sum(count)) |>
  ungroup() |>
  mutate(sex01 = ifelse(sex=="m", 0, 1)) |>
  select(-sex)

ggplot(lineup(null_dist("count", "binom", 
                        list(size=sum(tb_oz_2012$count),
                               p=0.5)), 
              tb_oz_2012, n=10),
       aes(x=sex01, y=count)) +
  geom_col() +
  facet_wrap(~.sample, ncol=5) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        panel.grid.major = element_blank())

Key points

Make many plots
Drill down to subsets provided by other variables
Changing bins or binwidth/bandwidth in histogram, violin or density plots can paint a different picture
Transformation can re-focus to different aspects of the same data
Compute a variety of statistics, compare and contrast values
Consider different representations of categorical variables
- reordering meaningfully,
- lumping low frequencies together,
- plot or table, pie or barplot,
- include a missing category
Be careful in making conclusions when there are sub-regions with few observations

Imputing missings for univariate distributions

Quantitative variable: Simulate from a fitted distribution.

df2 <- df2 |>
  mutate(lPrice = log10(Price),
         price_miss = ifelse(is.na(Price), "yes", "no"))

df2_smry <- df2 |>
  summarise(m = mean(lPrice, na.rm=TRUE),
            s = sd(lPrice, na.rm=TRUE))
set.seed(1003)  
df2 <- df2 |>
  rowwise() |>
  mutate(lPrice = ifelse(price_miss == "yes", 
    rnorm(1, df2_smry$m, df2_smry$s), lPrice)) |>
  mutate(Price = ifelse(price_miss == "yes", 10^lPrice, Price))

Categorical variable: Simulate from multinomial.

# A tibble: 7 × 2
  age   count
  <chr> <dbl>
1 1524     27
2 2534     48
3 3544     15
4 4554     11
5 5564      9
6 65       15
7 u        12

 [1] 5 3 1 1 2 1 1 2 1 1 1 1

# A tibble: 7 × 2
  age   count
  <chr> <dbl>
1 1524     35
2 2534     50
3 3544     16
4 4554     11
5 5564     10
6 65       15
7 u        12

imputeMulti library can automate for multiple variables.

Resources

Unwin (2015) Graphical Data Analysis with R
Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management 5 81-102. Original data.
Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. Journal of Environmental Economics and Management 31 403-405. Provided corrections and examined censoring.
Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24
British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing
Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Fleiss JL (1993): The statistical basis of meta-analysis. Statistical Methods in Medical Research 2 121–145
Balduzzi S, Rücker G, Schwarzer G (2019), How to perform a meta-analysis with R: a practical tutorial, Evidence-Based Mental Health.
Josse et al (2022) R-miss-tastic, https://rmisstastic.netlify.app