ETC5521 Worksheet Week 8

Going beyond two variables, exploring high dimensions

Author

Prof. Di Cook

Exercise 1: Parkinsons

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson’s disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals (“name” column). The main aim of the data is to discriminate healthy people from those with PD, according to “status” column which is set to 0 for healthy and 1 for PD.

The data is available at The UCI Machine Learning Repository in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. There are 24 variables in the file, including the persons name in column 1.

The data are originally analysed in: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering (to appear).

Code

library(cassowaryr)
# Load the data
data(pk)

How many pairwise plots would you need to look at, to look at all of them?

Solution

There are 23 numeric variables in the data set, which would require 253 pairwise plots to be made.

Compute several of the scagnostics (monotonic, outlying, clumpy2) for the first five variables of variables, except for name. (Note: We are using just five for computing speed, but the scagnostics could be calculated on all variables.)

Code

# Compute the scagnostics on the relevant variables
s <- calc_scags_wide(pk[,2:5],
                scags=c("outlying","monotonic",
                        "clumpy2"))
s

Solution

# A tibble: 6 × 5
  Var1           Var2         outlying clumpy2 monotonic
  <fct>          <fct>           <dbl>   <dbl>     <dbl>
1 MDVP:Fhi(Hz)   MDVP:Fo(Hz)     0.412   0.756    0.796 
2 MDVP:Flo(Hz)   MDVP:Fo(Hz)     0.172   0.541    0.324 
3 MDVP:Flo(Hz)   MDVP:Fhi(Hz)    0.336   0.671    0.0956
4 MDVP:Jitter(%) MDVP:Fo(Hz)     0.289   0        0.270 
5 MDVP:Jitter(%) MDVP:Fhi(Hz)    0.541   0.764    0.0978
6 MDVP:Jitter(%) MDVP:Flo(Hz)    0.430   0        0.407

Sort the scagnostics, separately by the values on (i) monotonic (ii) outlying (iii) clumpy2, and plot the pair of variables with the highest values on each.

Solution

Code

# Check the results for monotonic
s %>% 
  select(Var1, Var2, monotonic) %>% 
  arrange(desc(monotonic))

# A tibble: 6 × 3
  Var1           Var2         monotonic
  <fct>          <fct>            <dbl>
1 MDVP:Fhi(Hz)   MDVP:Fo(Hz)     0.796 
2 MDVP:Jitter(%) MDVP:Flo(Hz)    0.407 
3 MDVP:Flo(Hz)   MDVP:Fo(Hz)     0.324 
4 MDVP:Jitter(%) MDVP:Fo(Hz)     0.270 
5 MDVP:Jitter(%) MDVP:Fhi(Hz)    0.0978
6 MDVP:Flo(Hz)   MDVP:Fhi(Hz)    0.0956

Code

ggplot(data=pk, 
       aes(x=`MDVP:Fhi(Hz)`, y=`MDVP:Fo(Hz)`)) + 
  geom_point() + theme(aspect.ratio = 1)

The top pair of variables on monotonic has a strong positive association with the majority of points, and a few outliers. The top pair of variables on outlying, is also the top pair on clumpy2, and has outliers with some clumpiness in the mass of points with low values.

Make an interactive scatterplot matrix. Browse over it to choose other interesting pairs of variables and make the plots.

Solution

Code

# Create an interactive splom
s <- s %>%
  mutate(vars = paste(Var1, Var2))
highlight_key(s) %>%
  GGally::ggpairs(columns = 3:5, mapping = aes(label = vars)) %>%
  ggplotly(tooltip = "all", 
           width=500, 
           height=500) %>%
  highlight("plotly_selected")

The scagnostics help us to find interesting associations between pairs of variables. However, the problem here is to detect differences between Parkinsons’ patients and normal patients. How would you go about that? Think about some ideas long the line of scagnostics but look for differences between the two groups.

Solution

Code

# One way to examine difference between Parkinsons and healthy
pk_med <- pk %>% 
  select(-name) %>% 
  group_by(status) %>%
  summarise_all(list(median, sd)) %>%
  pivot_longer(
    cols=`MDVP:Fo(Hz)_fn1`:`PPE_fn2`,
    names_to="var", 
    values_to="value") %>%
  separate(var, c("var","stat"), "_") %>%
  mutate(stat = fct_recode(stat,
                           "m"="fn1",
                           "s"="fn2")) %>%
  pivot_wider(names_from=stat,
              values_from=value) %>%
  group_by(var) %>%
  summarise(
    d = (m[status==0]-m[status==1])/sqrt(s[status==0]^2+s[status==1]^2))
pk_med %>% arrange(desc(d)) %>% head()

# A tibble: 6 × 2
  var               d
  <chr>         <dbl>
1 MDVP:Fo(Hz)   0.870
2 HNR           0.647
3 MDVP:Fhi(Hz)  0.518
4 MDVP:Flo(Hz)  0.211
5 NHR          -0.243
6 MDVP:RAP     -0.356

Code

ggplot(pk, aes(x=factor(status), y=`MDVP:Fo(Hz)`)) + 
  geom_boxplot()

Generally we are looking for variables where the differences between the Parkinsons and normal patients are big. You need to measure big, relative to the variance of each group. Doing a two sample t-test for each variable is one approach. Here, I’ve computed the median for each group of patients and compared the difference in medians relative to the pooled standard deviation in each group.

Now try to do this using the scagnostics.

Solution

Code

# Check the pair of variables already examined
ggplot(data=pk, 
       aes(x=`MDVP:Fhi(Hz)`, 
           y=`MDVP:Fo(Hz)`, 
           colour=factor(status))) + 
  geom_point(alpha=0.5) + 
  scale_colour_discrete_divergingx(palette="Zissou 1") +
  theme(aspect.ratio = 1,
        legend.title = element_blank())

Code

# Compute scagnostics separately for each group and compare differences
pk_0 <- pk |>
  filter(status == 0)
pk_1 <- pk |>
  filter(status == 1)
s_0 <- calc_scags_wide(pk_0[,2:5],
                scags=c("outlying","monotonic",
                        "clumpy2"))
s_1 <- calc_scags_wide(pk_1[,2:5],
                scags=c("outlying","monotonic",
                        "clumpy2"))
s_0 <- s_0 |>
  mutate(status = 0)
s_1 <- s_1 |>
  mutate(status = 1)
s <- bind_rows(s_0, s_1)  
s_wide <- s |>
  pivot_longer(outlying:monotonic, names_to="scag", values_to="value") |>
  pivot_wider(names_from = status, values_from = "value") |>
  mutate(d = abs(`1`-`0`)) |>
  arrange(desc(d))

Code

# Plot top pair
ggplot(data=pk, 
       aes(x=`MDVP:Jitter(%)`, 
           y=`MDVP:Fo(Hz)`, 
           colour=factor(status))) + 
  geom_point(alpha=0.5) + 
  scale_colour_discrete_divergingx(palette="Zissou 1") +
  theme(aspect.ratio = 1,
        legend.title = element_blank())