With the birth of photography, and particular motion photography, Muybridge was able to illustrate that all four legs are never extended simultaneously.
When you look at data, you might discover that there is a different story, or many different stories.
Try to see with fresh eyes
Outline
The humble but powerful scatterplot
Additions and variations
Transformations to linearity
(Robust) numerical measures of association
Simpson’s paradox
Making null samples to test for association
Imputing missings
The scatterplot
Scatterplots are the natural plot to make to explore association between two continuous (quantitative or numeric) variables.
They are not just for linear relationships but are useful for examining nonlinear patterns, clustering and outliers.
We also can think about scatterplots in terms of statistical distributions: if a histogram shows a marginal distribution, a scatterplot allows us to examine the bivariate distribution of a sample.
History
Descartes provided the Cartesian coordinate system in the 17th century, with perpendicular lines indicating two axes.
It wasn’t until 1832 that the scatterplot appeared, when John Frederick Herschel plotted position and time of double stars.
This is 200 years after the Cartesian coordinate system, and 50 years after bar charts and line charts appeared, used in the work of William Playfair to examine economic data.
Kopf argues that The scatter plot, by contrast, proved more useful for scientists, but it clearly is useful for economics today.
http://www.datavis.ca/milestones/
Language and terminology
Are the words “correlation” and “association” interchangeable?
In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Wikipedia
If the relationship is not linear, call it association, and avoid correlated.
Features of a pair of continuous variables (1/3)
Feature
Example
Description
positive trend
Low value corresponds to low value, and high to high.
negative trend
Low value corresponds to high value, and high to low.
no trend
No relationship
strong
Very little variation around the trend
moderate
Variation around the trend is almost as much as the trend
weak
A lot of variation making it hard to see any trend
Features of a pair of continuous variables (2/3)
Feature
Example
Description
linear form
The shape is linear
nonlinear form
The shape is more of a curve
nonlinear form
The shape is more of a curve
outliers
There are one or more points that do not fit the pattern on the others
clusters
The observations group into multiple clumps
gaps
There is a gap, or gaps, but its not clumped
Features of a pair of continuous variables (3/3)
Feature
Example
Description
barrier
There is combination of the variables which appears impossible
l-shape
When one variable changes the other is approximately constant
discreteness
Relationship between two variables is different from the overall, and observations are in a striped pattern
heteroskedastic
Variation is different in different areas, maybe depends on value of x variable
weighted
If observations have an associated weight, reflect in scatterplot, e.g. bubble chart
Additional considerations (Unwin, 2015):
causation: one variable has a direct influence on the other variable, in some way. For example, people who are taller tend to weigh more. The dependent variable is conventionally on the y axis. It’s not generally possible to tell from the plot that the relationship is causal, which typically needs to be argued from other sources of information.
association: variables may be related to one another, but through a different variable, eg ice cream sales are positively correlated with beach drownings, is most likely a temperature relationship.
conditional relationships: the relationship between variables is conditionally dependent on another, such as income against age likely has a different relationship depending on retired or not.
Famous data examples
Famous scatterplot examples
Anscombe’s quartet
All four sets of Anscombe has same means, standard deviations and correlations, \(\bar{x}\) = 9, \(\bar{y}\) = 7.5, \(s_x\) = 3.3, \(s_y\) = 2, \(r\) = 0.82.
Numerical statistics are the same, for very different association.
Datasaurus dozen
And similarly all 13 sets of the datasaurus dozen have same means, standard deviations and correlations, \(\bar{x}\) = 54, \(\bar{y}\) = 48, \(s_x\) = 17, \(s_y\) = 27, \(r\) = -0.06.
We learned that association between height and weight is different strata, defined by categorical variables: sport, gender, and possibly country and age, too.
Some of the association may be due to unmeasured variables, for example, “Athletics” is masking different body types in throwing vs running. This is a lurking variable.
If you were just given the Height and Weight in this data could you have detected the presence of conditional relationships?
It’s not easy to detect the presence of the additional variable, and thus accurately describe the relationship between height and weight among Olympic athletes.
Generally, people don’t do very well at this task. Typically people under-estimate \(r\) from scatterplots, particularly when it is around 0.4-0.7. The variation in a scatterplot perceptually doesn’t vary is not linearly with \(r\).
When someone says correlation is 0.5 it sounds impressive. BUT when someone shows you a scatterplot of data that has correlation 0.5, you will say that’s a weak relationship.
Kendall \(\tau\) (based on comparing pairs of observations)
Sort each variable, and return rank (of actual value)
For all pairs of observations \((x_i, y_i), (x_j, y_j)\), determine if concordant, \(x_i < x_j, y_i < y_j\) or \(x_i > x_j, y_i > y_j\), or discordant, \(x_i < x_j, y_i > y_j\) or \(x_i > x_j, y_i < y_j\).
\[\tau = \frac{n_c-n_d}{\frac12 n(n-1)}\]
cor(df$x, df$y)
[1] 0.94
cor(df$x, df$y, method ="kendall")
[1] 0.067
Comparison of correlation measures
sample
corr
spearman
kendall
0.52
0.512
0.355
-0.05
-0.087
-0.073
0.30
-0.023
-0.014
Robust calculation corrects outlier problems, but nothing measures the non-linear association.
Transformations
for skewness, heteroskedasticity and linearising relationships, and to emphasize association
Circle of transformations for linearising
Remember the power ladder:
-1, 0, 1/3, 1/2, 1, 2, 3, 4
Look at the shape of the relationship.
Imagine this to be a number plane, and depending on which quadrant the shape falls in, you either transform \(x\) or \(y\), up or down the ladder: +,+ both up; +,- x up, y down; -,- both down; -,+ x down, y up
If there is heteroskedasticity, try transforming \(y\), may or may not help
Scatterplot case studies
Case study: Soils (1/4)
Interplay between skewness and association
Data is from a soil chemical analysis of a farm field in Iowa. Is there a relationship between Yield and Boron?
You can get a marginal plot of each variable added to the scatterplot using ggMarginal. This is useful for assessing the skewness in each variable.
Boron is right-skewed Yield is left-skewed. With skewed distributions in marginal variables it is hard to assess the relationship between the two. Make a transformation to fix, first.
Case study: Soils (2/4)
p <-ggplot( baker,aes(x = B, y = Corn97BU^2)) +geom_point() +xlab("log Boron (ppm)") +ylab("Corn Yield^2 (bushells)") +scale_x_log10() ggMarginal(p, type ="density")
Case study: Soils (3/4)
Lurking variable?
p <-ggplot( baker,aes(x = Fe, y = Corn97BU^2)) +geom_density2d(colour ="orange") +geom_point() +xlab("Iron (ppm)") +ylab("Corn Yield^2 (bushells)")ggMarginal(p, type ="density")
If calcium levels in the soil are high, yield is consistently high. If calcium levels are low, then there is a positive relationship between yield and iron, with higher iron leading to higher yields.
Bubble plots, size of point is mapped to another variable.
This bubble plot here shows total count of COVID-19 incidence (as of Aug 30, 2020) for every county in the USA, inspired by the New York Times coverage.
load(here("data/nyt_covid.rda"))usa <-map_data("state")ggplot() +geom_polygon(data = usa,aes(x = long, y = lat, group = group),fill ="grey90", colour ="white" ) +geom_point(data = nyt_county_total,aes(x = lon, y = lat, size = cases),colour ="red", shape =1 ) +geom_point(data = nyt_county_total,aes(x = lon, y = lat, size = cases),colour ="red", fill ="red", alpha =0.1, shape =16 ) +scale_size("", range =c(1, 30)) +theme_map() +theme(legend.position ="none")
Scales matter
Where has COVID-19 hit the hardest?
Where are there more people?
This plot tells you NOTHING except where the population centres are in the USA.
To understand relative incidence/risk, report COVID numbers relative the population. For example, number of cases per 100,000 people.
Beyond quantitative variables
When variables are not quantitative
What do you do if the variables are not continuous/quantitative?
Type of variable determines the appropriate mapping.
Continuous and categorical: side-by-side boxplots, side-by-side density plots
Both categorical: faceted bar charts, stacked bar charts, mosaic plots, double decker plots
Stay tuned!
Paradoxes
Simpsons paradox
There is an additional variable, which if used for conditioning, changes the association between the variables, you have a paradox.
Simpson’s paradox: famous example (1/2)
Did Berkeley discriminate against female applicants?
Example from Unwin (2015)
Simpson’s paradox: famous example (2/2)
Based on separately examining each department, there is no evidence of discrimination against female applicants.
ggplot(lineup(null_permute("Corn97BU"), baker, n =12),aes(x = B, y = Corn97BU)) +geom_point() +facet_wrap(~.sample, ncol =4)
11 of the panels have had the association broken by permuting one variable. There is no association in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association) plots?
data(oly12, package ="VGAMdata")oly12_sub <- oly12 |>filter(Sport %in%c("Swimming", "Archery","Hockey", "Tennis" )) |>filter(Sex =="F") |>mutate(Sport =fct_drop(Sport), Sex =fct_drop(Sex))ggplot(lineup(null_permute("Sport"), oly12_sub, n =12),aes(x = Height, y = Weight, colour = Sport)) +geom_smooth(method ="lm", se =FALSE) +scale_colour_brewer("", palette ="Dark2") +facet_wrap(~.sample, ncol =4) +theme(legend.position ="none")
11 of the panels have had the association broken by permuting the Sport label. There is no difference in the association between weight and height across sports in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association difference between sports) plots?
Handling and imputing missings (1/2)
Check if missings on one variable are related to distribution of the other variable.
Wilke (2019) Fundamentals of Data Visualization https://clauswilke.com/dataviz/
Friendly and Denis “Milestones in History of Thematic Cartography, Statistical Graphics and Data Visualisation” available at http://www.datavis.ca/milestones/
Tierney et al (2023) Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.