ETC5521: Diving Deeply into Data Exploration

Week 2: Learning from history

Professor Di Cook

Department of Econometrics and Business Statistics

Birth of EDA

The field of exploratory data analysis came of age when this book appeared in 1977.

Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.

John W. Tukey

Image source: wikimedia.org

Born in 1915, in New Bedford, Massachusetts.
Mum was a private tutor who home-schooled John. Dad was a Latin teacher.
BA and MSc in Chemistry, and PhD in Mathematics
Awarded the National Medal of Science in 1973, by President Nixon
By some reports, his home-schooling was unorthodox and contributed to his thinking and working differently.

Taking a glimpse back in time

is possible with the American Statistical Association video lending library.

We’re going to watch John Tukey talking about exploring high-dimensional data with an amazing new computer in 1973, four years before the EDA book.

Look out for these things:

Tukey’s expertise is described as for trial and error learning and the computing equipment.

First 4.25 minutes

Setting the frame of mind

Excerpt from the introduction

This book is based on an important principle.

It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.

Learning first what you can do will help you to work more easily and effectively.

This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation.

Examples, NOT case histories

The book does not exist to make the case that exploratory data analysis is useful. Rather it exists to expose its readers and users to a considerable variety of techniques for looking more effectively at one’s data. The examples are not intended to be complete case histories. Rather they should isolated techniques in action on real data. The emphasis is on general techniques, rather than specific problems.

A basic problem about any body of data is to make it more easily and effectively handleable by minds – our minds, her mind, his mind. To this general end:

anything that make a simpler description possible makes the description more easily handleable.
anything that looks below the previously described surface makes the description more effective.

So we shall always be glad (a) to simplify description and (b) to describe one layer deeper. In particular,

to be able to say that we looked one layer deeper, and found nothing, is a definite step forward – though not as far as to be able to say that we looked deeper and found thus-and-such.
to be able to say that “if we change our point of view in the following way … things are simpler” is always a gain–though not quite so much as to be able to say “if we don’t bother to change out point of view (some other) things are equally simple.”

…

Consistent with this view, we believe, is a clear demand that pictures based on exploration of data should force their messages upon us. Pictures that emphasize what we already know–“security blankets” to reassure us–are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. The greatest value of a picture is when it forces us to notice what we never expected to see.

Confirmation

The principles and procedures of what we call confirmatory data analysis are both widely used and one of the great intellectual products of our century. In their simplest form, these principles and procedures look at a sample–and at what that sample has told us about the population from which it came–and assess the precision with which our inference from sample to population is made. We can no longer get along without confirmatory data analysis. But we need not start with it.

The best way to understand what CAN be done is not longer–if it ever was–to ask what things could, in the current state of our skill techniques, be confirmed (positively or negatively). Even more understanding is lost if we consider each thing we can do to data only in terms of some set of very restrictive assumptions under which that thing is best possible–assumptions we know we CANNOT check in practice.

Exploration AND confirmation

Once upon a time, statisticians only explored. Then they learned to confirm exactly–to confirm a few things exactly, each under very specific circumstances. As they emphasized exact confirmation, their techniques inevitably became less flexible. The connection of the most used techniques with past insights was weakened. Anything to which confirmatory procedure was not explicitly attached was decried as “mere descriptive statistics”, no matter how much we learned from it.

Today, the flexibility of (approximate) confirmation by the jacknife makes it relatively easy to ask, for almost any clearly specified exploration, “How far is it confirmed?”

Today, exploratory and confirmatory can–and should–proceed side by side. This book, of course, considers only exploratory techniques, leaving confirmatory techniques to other accounts.

About the problems

The teacher needs to be careful about assigning problems. Not too many, please. They are likely to take longer than you think. The number supplied is to accommodate diversity of interest, not to keep everybody busy.

Besides the length of our problems, both teacher and student need to realise that many problems do not have a single “right answer”. There can be many ways to approach a body of data. Not all are equally good. For some bodies of data this may be clear, but for others we may not be able to tell from a single body of data which approach is preferred. Even several bodies of data about very similar situations may not be enough to show which approach should be preferred. Accordingly, it will often be quite reasonable for different analysts to reach somewhat different analyses.

Yet more–to unlock the analysis of a body of day, to find the good way to approach it, may require a key, whose finding is a creative act. Not everyone can be expected to create the key to any one situation. And to continue to paraphrase Barnum, no one can be expected to create a key to each situation he or she meets.

To learn about data analysis, it is right that each of us try many things that do not work–that we tackle more problems than we make expert analyses of. We often learn less from an expertly done analysis than from one where, by not trying something, we missed–at least until we were told about it–an opportunity to learn more. Each teacher needs to recognize this in grading and commenting on problems.

Precision

The teacher who heeds these words and admits that there need be no one correct approach may, I regret to contemplate, still want whatever is done to be digit perfect. (Under such a requirement, the write should still be able to pass the course, but it is not clear whether she would get an “A”.) One does, from time to time, have to produce digit-perfect, carefully checked results, but forgiving techniques that are not too distributed by unusual data are also, usually, little disturbed by SMALL arithmetic errors. The techniques we discuss here have been chosen to be forgiving. It is hoped, then, that small arithmetic errors will take little off the problem’s grades, leaving severe penalties for larger errors, either of arithmetic or concept.

Outline

Scratching down numbers
Schematic summary
Easy re-expression
Effective comparison
Plots of relationship
Straightening out plots (using three points)
Smoothing sequences
Parallel and wandering schematic plots
Delineations of batches of points
Using two-way analyses

Making two-way analyses
Advanced fits
Three way fits
Looking in two or more ways at batched of points
Counted fractions
Better smoothing
Counts in bin after bin
Product-ratio plots
Shapes of distributions
Mathematical distributions

Looking at numbers with Tukey

Scratching down numbers

Prices of Chevrolet in the local used car newspaper ads of 1968.

options(width=20)
chevrolets <- tibble(
  prices = c(250, 150, 795, 895, 695, 
               1699, 1499, 1099, 1693, 1166,
               688, 1333, 895, 1775, 895,
               1895, 795))
#chevrolets$prices

Stem-and-leaf plot: still seen in introductory statistics texts

First stem-and-leaf, first digit on stem, second digit on leaf

Order any leaves which need it, eg stem 6

A benefit is that the numbers can be read off the plot, but the focus is still on the pattern. Also quantiles like the median, can be computed easily.

Shrink the stem

Shrink the stem more

And, in R …

chevrolets$prices

 [1]  250  150  795
 [4]  895  695 1699
 [7] 1499 1099 1693
[10] 1166  688 1333
[13]  895 1775  895
[16] 1895  795

stem(chevrolets$prices)


  The decimal point is 3 digit(s) to the right of the |

  0 | 23
  0 | 7788999
  1 | 123
  1 | 57789

🔖 Remember the tips data

 [1] 1.01 1.66 3.50 3.31 3.61 4.71 2.00 3.12 1.96 3.23 1.71 5.00 1.57 3.00 3.02
[16] 3.92 1.67 3.71 3.50 3.35 4.08 2.75 2.23 7.58 3.18 2.34 2.00 2.00 4.30 3.00
[31] 1.45 2.50 3.00 2.45 3.27 3.60 2.00 3.07 2.31 5.00 2.24 2.54 3.06 1.32 5.60
[46] 3.00 5.00 6.00 2.05 3.00

stem(tips$tip, scale=0.5, width=120)


  The decimal point is at the |

   1 | 000001233334445555555555556666667777788889
   2 | 000000000000000000000000000000000000000001122222223333555555555555556666677788899
   3 | 00000000000000000000000011111112222222333344445555555555555666778889
   4 | 0000000000001112233335777
   5 | 00000000001122226799
   6 | 05577
   7 | 6
   8 | 
   9 | 0
  10 | 0

Refining the size

Five digits per stem

What is the number in parentheses? And why might this be useful?

Two digits per stem

stem(tips$tip, scale=2)


  The decimal point is 1 digit(s) to the left of the |

   10 | 0000107
   12 | 55526
   14 | 44578000000000678
   16 | 1346781356
   18 | 032678
   20 | 00000000000000000000000000000000011233598
   22 | 0033440114
   24 | 5700000000002456
   26 | 01412455
   28 | 382
   30 | 00000000000000000000000267891245688
   32 | 133557159
   34 | 0188800000000015
   36 | 0181566
   38 | 2
   40 | 0000000000006889
   42 | 09004
   44 | 0
   46 | 713
   48 | 
   50 | 000000000074567
   52 | 0
   54 | 
   56 | 05
   58 | 52
   60 | 0
   62 | 
   64 | 00
   66 | 03
   68 | 
   70 | 
   72 | 
   74 | 8
   76 | 
   78 | 
   80 | 
   82 | 
   84 | 
   86 | 
   88 | 
   90 | 0
   92 | 
   94 | 
   96 | 
   98 | 
  100 | 0

Why no number in parentheses?

median(tips$tip)

[1] 2.9

Summary

Stem-and-leaf plots are similar information to the histogram.
Generally it is possible to also read off the numbers, and to then easily calculate median or Q1 or Q3.
It’s great for small data sets, when you only have pencil and paper.
Alternatives are a histogram, (jittered) dotplot, density plot, box plot, violin plot, letter value plot.

a different style of number scratching

for categorical variables

We know about

but its too easy to

make a mistake

Is this easier?

or harder

Count this data using the squares approach.

 [1] "F" "M" "M" "M" "F" "M"
 [7] "M" "M" "M" "M" "M" "F"
[13] "M" "M" "F" "M" "F" "M"
[19] "F" "M" "M" "F" "F" "M"
[25] "M" "M" "M" "M" "M" "F"
[31] "M" "M" "F" "F" "M" "M"
[37] "M" "F" "M" "M" "M" "M"
[43] "M" "M" "M" "M" "M" "M"
[49] "M" "M" "M" "F" "F" "M"
[55] "M" "M" "M" "F" "M" "M"
[61] "M" "M" "M" "M" "M" "M"
[67] "F" "F" "M" "M" "M" "F"

What does it mean to “feel what the data are like?”

This is a stem and leaf of the height of the highest peak in each of the 50 US states.

The states roughly fall into three groups.

It’s not really surprising, but we can imagine this grouping. Alaska is in a group of its own, with a much higher high peak. Then the Rocky Mountain states, California, Washington and Hawaii also have high peaks, and the rest of the states lump together.

Exploratory data analysis is detective work – in the purest sense – finding and revealing the clues.

More summaries of numerical values

Hinges and 5-number summaries

 [1] -3.2 -1.7 -0.4  0.1
 [5]  0.3  1.2  1.5  1.8
 [9]  2.4  3.0  4.3  6.4
[13]  9.8

You know the median is the middle number. What’s a hinge?

There are 13 data values here, provided already sorted. We are going to write them into a Tukey named down-up-down-up pattern, evenly.

Median will be 7th, hinge will be 4th from each end.

Hinges and 5-number summary

Hinges are almost always the same as Q1 and Q3

box-and-whisker display

Starting with a 5-number summary

box-and-whisker display

Starting with a 5-number summary

Identified end values

Why are some individual points singled out?

Rules for this one may be clearer?

Isn’t this imposing a belief?

There is no excuse for failing to plot and look

Another Tukey wisdom drop

Fences and outside values

H-spread: difference between the hinges (we would call this Inter-Quartile Range)
step: 1.5 times H-spread
inner fences: 1 step outside the hinges
outer fences: 2 steps outside the hinges
the value at each end closest to, but still inside the inner fence are “adjacent”
values between an inner fence and its neighbouring outer fence are “outside”
values beyond outer fences are “far out”
these rules produce a SCHEMATIC PLOT

New statistics: trimeans

The number that comes closest to

\[\frac{\text{lower hinge} + 2\times \text{median} + \text{upper hinge}}{4}\] is the trimean.

Think about trimmed means, where we might drop the highest and lowest 5% of observations.

Letter value plots

Why break the data into quarters? Why not eighths, sixteenths? k-number summaries?

What does a 7-number summary look like?

How would you make an 11-number summary?

library(lvplot)
p <- ggplot(mpg, 
            aes(class, hwy))
p + geom_lv(aes(fill=..LV..)) + 
  scale_fill_brewer() + 
  coord_flip() + 
  xlab("")

Box plots are ubiquitous in use today.

- 🐈🐩 Mostly used to compare distributions, multiple subsets of the data.

Puts the emphasis on the middle 50% of observations, although variations can put emphasis on other aspects.

Easy re-expression

Logs, square roots, reciprocals

What you need to know about logs?

how to find good enough logs fast and easily
that equal differences in logs correspond to equal ratios of raw values.

(This means that wherever you find people using products or ratios– even in such things as price indexes–using logs–thus converting producers to sums and ratios to differences–is likely to help.)

The most common transformations are logs, sqrt root, reciprocals, reciprocals of square roots

-1, -1/2, +1/2, +1

What happened to ZERO?

It turns out that the role of a zero power, is for the purposes of re-expression, neatly solved by the logarithm.

Re-express to symmetrize the distribution

Power ladder

⬅️ fix RIGHT-skewed values

-2, -1, -1/2, 0 (log), 1/3, 1/2, 1, 2, 3, 4

fix LEFT-skewed values ➡️

We now regard re-expression as a tool, something to let us do a better job of grasping. The grasping is done with the eye and the better job is through a more symmetric appearance.

Another Tukey wisdom drop

Linearising bivariate relationships

Surprising observation: The small fluctuations in later years.

What might be possible reasons?

Linearising bivariate relationships

See some fluctuations in the early years, too. Note that the log transformation couldn’t linearise.

Whatever the data, we can try to gain by straightening or by flattening.

When we succeed in doing one or both, we almost always see more clearly what is going on.

Rules and advice

Graphics are friendly.
Arithmetic often exists to make graphs possible.
Graphs force us to notice the unexpected; nothing could be more important.
Different graphs show us quite different aspects of the same data.
There is no more reason to expect one graph to “tell all” than to expect one number to do the same.
“Plotting \(y\) against \(x\)” involves significant choices–how we express one or both variables can be crucial.

The first step in penetrating plotting is to straighten out the dependence or point scatter as much as reasonable.
Plotting \(y^2\), \(\sqrt{y}\), \(log(y)\), \(-1/y\) or the like instead of \(y\) is one plausible step to take in search of straightness.
Plotting \(x^2\), \(\sqrt{x}\), \(log(x)\), \(-1/x\) or the like instead of \(x\) is another.
Once the plot is straightened, we can usually gain much by flattening it, usually by plotting residuals.
When plotting scatters, we may need to be careful about how we express \(x\) and \(y\) in order to avoid concealment by crowding.

The book is a digest of 🌟 tricks and treats 🌟 of massaging numbers and drafting displays.

Many of the tools have made it into today’s analyses in various ways. Many have not.

Notice the word developments too: froots, fences. Tukey brought you the word “software”

The temperament of the book is an inspiration for the mind-set for this unit. There is such delight in working with numbers!

“We love data!”

Take-aways

Tukey’s approach was a reaction to many years of formalising data analysis using statistical hypothesis testing.
Methodology development in statistical testing was a reaction to the ad-hoc nature of data analysis.
Complex machine learning models like neural networks are in reaction to the inability of statistical models to capture highly non-linear relationships, and depend heavily on the data provided.
Exploring data today is in reaction to the need to explain complex models, to support organisations against legal challenges to decisions made from the model
It is much easier to accomplish computers.
“Exploratory data analysis” as commonly used today term is unfortunately synonymous with “descriptive statistics”, but it is truly much more. Understanding its history from Tukey’s advocation helps you see it is the tooling to discover what you don’t know.

Resources

wikipedia
John W. Tukey (1977) Exploratory data analysis
Data coding using tidyverse suite of R packages
Sketching canvases made using fabricerin