Majority of the country does not have first preference for the Greens
Some constituents are slightly more supportive than the others
What further questions does it raise?
Notes:
Australia uses full-preference instant-runoff voting in single member seats
Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election.
In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party.
skimr::skim(tdf3)
── Data Summary ────────────────────────
Values
Name tdf3
Number of rows 151
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None
── Variable type: character ────────────────────────────────
skim_variable n_missing complete_rate min max empty
1 DivisionID 0 1 3 3 0
2 DivisionNm 0 1 4 15 0
3 State 0 1 2 3 0
n_unique whitespace
1 151 0
2 151 0
3 8 0
── Variable type: numeric ──────────────────────────────────
skim_variable n_missing complete_rate mean sd
1 votes_GRN 0 1 9821. 5581.
2 votes_total 0 1 99925. 9801.
3 perc_GRN 0 1 9.87 5.63
p0 p25 p50 p75 p100 hist
1 2744 6555 8676 11532. 45876 ▇▂▁▁▁
2 51009 96372. 100936 105588 116216 ▁▁▁▇▅
3 2.89 6.43 8.55 11.4 47.8 ▇▂▁▁▁
Formulating questions for EDA vs making observations from a plot
BEFORE plotting or making summaries think broad (open-ended) questions about the distribution of values
Questions with simple answers (i.e. yes or no) less helpful in encouraging exploration using graphics
For example,
What is the distribution of the first preference vote percentages for the Labor party across Australia?
Is it evenly spread across electorates or are there clusters of popularity?
AFTER plotting or making summaries think was this what you expected, are there any surprises. Detail what you learn, and how you should follow up on these observations.
Is the outlying observation the electoral district that won the seat?
Visual inference
Suitable null models for a single variable all focus on potential distributions.
Typical plot description:
ggplot(data, aes(x=var1)) +geom_histogram()
Is the distribution consistent with a sample from a
normal distribution?
uniform distribution?
skewed distribution?
MANY OTHER POTENTIAL DISTRIBUTIONS
Potential simulation methods from specific distributions
Using the exponential distribution as the null says that we expect most electorates to have small tallies for Greens, and only a few electorates will have large, potentially winnable tallies.
NOTE: We’ve already seen the data so we can’t be impartial judges for choosing the most different plot. We can use the null plots to check whether the small mode of moderately high tallies is unusual if the tallies really are samples from an exponential distribution.
Interquartile range difference between 1st and 3rd quartile, more robust measure of spread
Median absolute deviance (MAD) is even more robust
\[\text{median}(|x_i - \text{median}(x_i)|)\]
Measure of dispersion
Plot
SD
IQR
MAD
Skewness
Kurtosis
1
0.90
1.19
0.87
-0.072
3.0
2
0.99
1.41
1.08
0.358
2.2
3
1.33
1.18
0.79
1.944
7.2
4
0.29
0.45
0.34
-0.126
1.8
5
0.47
0.50
0.34
1.691
6.4
6
2.78
5.36
2.98
-0.351
1.7
Inference for robust statistics
We have seen the re-sampling methods simulation and permutation used for generating null plots in a lineup. Re-sampling methods can be used with numeric statistics also.
Simulation from distribution, can be used to to check for outliers.
We can also compute how many simulated values are more than the observed which gives a simulation \(p\)-value: 0.61.
For sample means, conventional tests provide a means for assessing what might be observed if different samples were taken.
Bootstrapping the current sample, can be used for robust statistics. If we have a sample of values:
[1] 2 2 3 6 7 7 8 8
to bootstrap sample with replacement:
sort(sample(x, replace=TRUE))
[1] 2 2 3 3 7 7 7 7
sort(sample(x, replace=TRUE))
[1] 2 3 6 6 6 6 8 8
Here’s an example of bootstrapping to get a confidence interval for a median.
The width of the boxplot is proportional to the number of electoral districts in the corresponding state (which is roughly proportional to the population).
Outliers are observations that are significantly different from the majority.
Outliers can occur by chance in almost all distributions, but could be indicative of:
a measurement error,
a different population, or
an issue with the sampling process.
Closer look at the boxplot
Observations that are outside the range of lower to upper fence (1.5 times the box length) are often referred to as outliers.
Plotting boxplots for data from a skewed distribution will almost always show these “outliers” but these are not necessarily outliers.
Some definitions of outliers assume a symmetrical population distribution (e.g. in boxplots or observations a certain standard deviations away from the mean) and these definitions are ill-suited for asymmetrical distributions.
Declaring observations outliers typically requires additional data context.
df2 |>mutate(miss =ifelse(is.na(Price), "Missing", "Recorded")) |>count(Rooms, miss) |>filter(Rooms <8) |>group_by(miss) |>mutate(perc = n /sum(n) *100) |>ggplot(aes(as.factor(Rooms), perc, fill = miss)) +geom_col(position ="dodge") +scale_fill_viridis_d(begin=0.3, end=0.7) +labs(x ="Rooms", y ="Percentage", fill ="Price") +theme(aspect.ratio =0.8)
Is there a suspicious plot?
Break the association between Rooms and Missing/Not on Price, because the null hypothesis is that there is no difference in missing status for price based on the size of the house. Why?
There appears to be a lot of outlying housing prices (how can we tell?)
Note: We determined that it is likely that more higher price houses have not disclosed the sale price. The distribution of price will need to be checked again after imputation.
Upon further exploration, you can find the two movies that are well over 16 hours long are “Cure for Insomnia”, “Four Stars”, and “Longest Most Meaningless Movie in the World”
We can restrict our attention to films under 3 hours:
Notice that there is a peak at particular times. Why do you think so?
ggplot(movies, aes(length)) +geom_histogram(color ="white") +labs(x ="Length of movie (minutes)", y ="Frequency") +theme(aspect.ratio =0.6)ggplot(movies, aes(length)) +geom_histogram(color ="white") +labs(x ="Length of movie (minutes)", y ="Frequency") +scale_x_log10() +theme(aspect.ratio =0.6)
movies |>filter(length <180) |>ggplot(aes(length)) +geom_histogram(binwidth =1, fill ="#795549", color ="black") +labs(x ="Length of movie (minutes)", y ="Frequency")
Categorical variables
There are two types of categorical variables
Nominal where there is no intrinsic ordering to the categories E.g. blue, grey, black, white.
Ordinal where there is a clear order to the categories. E.g. Strongly disagree, disagree, neutral, agree, strongly agree.
Categorical variables in R
In R, categorical variables may be encoded as factors.
data <-c(2, 2, 1, 1, 3, 3, 3, 1)factor(data)
[1] 2 2 1 1 3 3 3 1
Levels: 1 2 3
You can easily change the labels of the variables:
factor(data, labels =c("I", "II", "III"))
[1] II II I I III III III I
Levels: I II III
Order of the factors are determined by the input:
# numerical input are ordered in increasing order factor(c(1, 3, 10))
[1] 1 3 10
Levels: 1 3 10
# character input are ordered by first char, alphabetically factor(c("1", "3", "10"))
[1] 1 3 10
Levels: 1 10 3
# you can specify order of levels explicitly factor(c("1", "3", "10"),levels =c("1", "3", "10"))
[1] 1 3 10
Levels: 1 3 10
Numerical summaries: counts, proportions, percentages and odds
Tuberculosis counts in Australia
# A tibble: 22 × 7
country iso3 year count p pct odds
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia AUS 2000 982 0.0522 5.22 1
2 Australia AUS 2001 953 0.0507 5.07 0.970
3 Australia AUS 2002 1008 0.0536 5.36 1.03
4 Australia AUS 2003 926 0.0493 4.93 0.943
5 Australia AUS 2004 1036 0.0551 5.51 1.05
6 Australia AUS 2005 1030 0.0548 5.48 1.05
7 Australia AUS 2006 1127 0.0600 6.00 1.15
8 Australia AUS 2007 1081 0.0575 5.75 1.10
9 Australia AUS 2008 1182 0.0629 6.29 1.20
10 Australia AUS 2009 1176 0.0626 6.26 1.20
11 Australia AUS 2010 1146 0.0610 6.10 1.17
12 Australia AUS 2011 1202 0.0640 6.40 1.22
13 Australia AUS 2012 1259 0.0670 6.70 1.28
14 Australia AUS 2013 512 0.0272 2.72 0.521
15 Australia AUS 2014 474 0.0252 2.52 0.483
16 Australia AUS 2015 438 0.0233 2.33 0.446
17 Australia AUS 2016 481 0.0256 2.56 0.490
18 Australia AUS 2017 524 0.0279 2.79 0.534
19 Australia AUS 2018 502 0.0267 2.67 0.511
20 Australia AUS 2019 554 0.0295 2.95 0.564
21 Australia AUS 2020 609 0.0324 3.24 0.620
22 Australia AUS 2021 593 0.0316 3.16 0.604
For qualitative data, compute
count/frequency,
proportion/percentage
and sometimes, an odds ratio. Here we have used ratio relative to the count in year 2000.
Note: For exploration, no rounding of digits was done, but to report you would need to make the numbers pretty.
For this problem there is nothing more to learn from a lineup that what can be learned from a conventional hypothesis test of \(H_0: p=0.5\).
binom.test(tb_oz_2012$count, n, p =0.5, alternative ="two.sided")
Exact binomial test
data: tb_oz_2012$count
number of successes = 314, number of trials = 997,
p-value <2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.29 0.34
sample estimates:
probability of success
0.31
imputeMulti library can automate for multiple variables.
Resources
Unwin (2015) Graphical Data Analysis with R
Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management5 81-102. Original data.
Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. Journal of Environmental Economics and Management31 403-405. Provided corrections and examined censoring.
Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24
British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing
Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229) Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Fleiss JL (1993): The statistical basis of meta-analysis. Statistical Methods in Medical Research2 121–145 Balduzzi S, Rücker G, Schwarzer G (2019), How to perform a meta-analysis with R: a practical tutorial, Evidence-Based Mental Health.
Josse et al (2022) R-miss-tastic, https://rmisstastic.netlify.app