Week 2: Learning from history
Department of Econometrics and Business Statistics
The field of exploratory data analysis came of age when this book appeared in 1977.
Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.
is possible with the American Statistical Association video lending library.
We’re going to watch John Tukey talking about exploring high-dimensional data with an amazing new computer in 1973, four years before the EDA book.
Look out for these things:
Tukey’s expertise is described as for trial and error learning and the computing equipment.
Excerpt from the introduction
This book is based on an important principle.
It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.
Learning first what you can do will help you to work more easily and effectively.
This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation.
Examples, NOT case histories
The book does not exist to make the case that exploratory data analysis is useful. Rather it exists to expose its readers and users to a considerable variety of techniques for looking more effectively at one’s data. The examples are not intended to be complete case histories. Rather they should isolated techniques in action on real data. The emphasis is on general techniques, rather than specific problems.
A basic problem about any body of data is to make it more easily and effectively handleable by minds – our minds, her mind, his mind. To this general end:
So we shall always be glad (a) to simplify description and (b) to describe one layer deeper. In particular,
…
Consistent with this view, we believe, is a clear demand that pictures based on exploration of data should force their messages upon us. Pictures that emphasize what we already know–“security blankets” to reassure us–are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. The greatest value of a picture is when it forces us to notice what we never expected to see.
The principles and procedures of what we call confirmatory data analysis are both widely used and one of the great intellectual products of our century. In their simplest form, these principles and procedures look at a sample–and at what that sample has told us about the population from which it came–and assess the precision with which our inference from sample to population is made. We can no longer get along without confirmatory data analysis. But we need not start with it.
The best way to understand what CAN be done is not longer–if it ever was–to ask what things could, in the current state of our skill techniques, be confirmed (positively or negatively). Even more understanding is lost if we consider each thing we can do to data only in terms of some set of very restrictive assumptions under which that thing is best possible–assumptions we know we CANNOT check in practice.
Once upon a time, statisticians only explored. Then they learned to confirm exactly–to confirm a few things exactly, each under very specific circumstances. As they emphasized exact confirmation, their techniques inevitably became less flexible. The connection of the most used techniques with past insights was weakened. Anything to which confirmatory procedure was not explicitly attached was decried as “mere descriptive statistics”, no matter how much we learned from it.
Today, the flexibility of (approximate) confirmation by the jacknife makes it relatively easy to ask, for almost any clearly specified exploration, “How far is it confirmed?”
Today, exploratory and confirmatory can–and should–proceed side by side. This book, of course, considers only exploratory techniques, leaving confirmatory techniques to other accounts.
The teacher needs to be careful about assigning problems. Not too many, please. They are likely to take longer than you think. The number supplied is to accommodate diversity of interest, not to keep everybody busy.
Besides the length of our problems, both teacher and student need to realise that many problems do not have a single “right answer”. There can be many ways to approach a body of data. Not all are equally good. For some bodies of data this may be clear, but for others we may not be able to tell from a single body of data which approach is preferred. Even several bodies of data about very similar situations may not be enough to show which approach should be preferred. Accordingly, it will often be quite reasonable for different analysts to reach somewhat different analyses.
Yet more–to unlock the analysis of a body of day, to find the good way to approach it, may require a key, whose finding is a creative act. Not everyone can be expected to create the key to any one situation. And to continue to paraphrase Barnum, no one can be expected to create a key to each situation he or she meets.
To learn about data analysis, it is right that each of us try many things that do not work–that we tackle more problems than we make expert analyses of. We often learn less from an expertly done analysis than from one where, by not trying something, we missed–at least until we were told about it–an opportunity to learn more. Each teacher needs to recognize this in grading and commenting on problems.
The teacher who heeds these words and admits that there need be no one correct approach may, I regret to contemplate, still want whatever is done to be digit perfect. (Under such a requirement, the write should still be able to pass the course, but it is not clear whether she would get an “A”.) One does, from time to time, have to produce digit-perfect, carefully checked results, but forgiving techniques that are not too distributed by unusual data are also, usually, little disturbed by SMALL arithmetic errors. The techniques we discuss here have been chosen to be forgiving. It is hoped, then, that small arithmetic errors will take little off the problem’s grades, leaving severe penalties for larger errors, either of arithmetic or concept.
First stem-and-leaf, first digit on stem, second digit on leaf
Order any leaves which need it, eg stem 6
A benefit is that the numbers can be read off the plot, but the focus is still on the pattern. Also quantiles like the median, can be computed easily.
Shrink the stem
Shrink the stem more
[1] 1.01 1.66 3.50 3.31 3.61 4.71 2.00 3.12 1.96 3.23 1.71 5.00 1.57 3.00 3.02
[16] 3.92 1.67 3.71 3.50 3.35 4.08 2.75 2.23 7.58 3.18 2.34 2.00 2.00 4.30 3.00
[31] 1.45 2.50 3.00 2.45 3.27 3.60 2.00 3.07 2.31 5.00 2.24 2.54 3.06 1.32 5.60
[46] 3.00 5.00 6.00 2.05 3.00
The decimal point is at the |
1 | 000001233334445555555555556666667777788889
2 | 000000000000000000000000000000000000000001122222223333555555555555556666677788899
3 | 00000000000000000000000011111112222222333344445555555555555666778889
4 | 0000000000001112233335777
5 | 00000000001122226799
6 | 05577
7 | 6
8 |
9 | 0
10 | 0
Five digits per stem
What is the number in parentheses? And why might this be useful?
Two digits per stem
The decimal point is 1 digit(s) to the left of the |
10 | 0000107
12 | 55526
14 | 44578000000000678
16 | 1346781356
18 | 032678
20 | 00000000000000000000000000000000011233598
22 | 0033440114
24 | 5700000000002456
26 | 01412455
28 | 382
30 | 00000000000000000000000267891245688
32 | 133557159
34 | 0188800000000015
36 | 0181566
38 | 2
40 | 0000000000006889
42 | 09004
44 | 0
46 | 713
48 |
50 | 000000000074567
52 | 0
54 |
56 | 05
58 | 52
60 | 0
62 |
64 | 00
66 | 03
68 |
70 |
72 |
74 | 8
76 |
78 |
80 |
82 |
84 |
86 |
88 |
90 | 0
92 |
94 |
96 |
98 |
100 | 0
Why no number in parentheses?
for categorical variables
We know about
but its too easy to
make a mistake
Is this easier?
or harder
[1] "F" "M" "M" "M" "F" "M"
[7] "M" "M" "M" "M" "M" "F"
[13] "M" "M" "F" "M" "F" "M"
[19] "F" "M" "M" "F" "F" "M"
[25] "M" "M" "M" "M" "M" "F"
[31] "M" "M" "F" "F" "M" "M"
[37] "M" "F" "M" "M" "M" "M"
[43] "M" "M" "M" "M" "M" "M"
[49] "M" "M" "M" "F" "F" "M"
[55] "M" "M" "M" "F" "M" "M"
[61] "M" "M" "M" "M" "M" "M"
[67] "F" "F" "M" "M" "M" "F"
This is a stem and leaf of the height of the highest peak in each of the 50 US states.
The states roughly fall into three groups.
It’s not really surprising, but we can imagine this grouping. Alaska is in a group of its own, with a much higher high peak. Then the Rocky Mountain states, California, Washington and Hawaii also have high peaks, and the rest of the states lump together.
[1] -3.2 -1.7 -0.4 0.1
[5] 0.3 1.2 1.5 1.8
[9] 2.4 3.0 4.3 6.4
[13] 9.8
You know the median is the middle number. What’s a hinge?
There are 13 data values here, provided already sorted. We are going to write them into a Tukey named down-up-down-up pattern, evenly.
Median will be 7th, hinge will be 4th from each end.
Hinges are almost always the same as Q1 and Q3
Starting with a 5-number summary
Starting with a 5-number summary
Why are some individual points singled out?
Rules for this one may be clearer?
Another Tukey wisdom drop
The number that comes closest to
\[\frac{\text{lower hinge} + 2\times \text{median} + \text{upper hinge}}{4}\] is the trimean.
Think about trimmed means, where we might drop the highest and lowest 5% of observations.
- 🐈🐩 Mostly used to compare distributions, multiple subsets of the data.
What you need to know about logs?
(This means that wherever you find people using products or ratios– even in such things as price indexes–using logs–thus converting producers to sums and ratios to differences–is likely to help.)
The most common transformations are logs, sqrt root, reciprocals, reciprocals of square roots
What happened to ZERO?
It turns out that the role of a zero power, is for the purposes of re-expression, neatly solved by the logarithm.
⬅️ fix RIGHT-skewed values
-2, -1, -1/2, 0 (log), 1/3, 1/2, 1, 2, 3, 4
fix LEFT-skewed values ➡️
Another Tukey wisdom drop
Surprising observation: The small fluctuations in later years.
What might be possible reasons?
See some fluctuations in the early years, too. Note that the log transformation couldn’t linearise.
The book is a digest of 🌟 tricks and treats 🌟 of massaging numbers and drafting displays.
Many of the tools have made it into today’s analyses in various ways. Many have not.
Notice the word developments too: froots, fences. Tukey brought you the word “software”
The temperament of the book is an inspiration for the mind-set for this unit. There is such delight in working with numbers!
“We love data!”
tidyverse
suite of R packagesfabricerin
ETC5521 Lecture 2 | ddde.numbat.space