ETC5521 Worksheet Week 5

Working with a single variable

Author

Prof. Di Cook

1. Understanding the velocity of galaxies

Load the galaxies data in the MASS package and answer the following questions based on this dataset.

Code

data(galaxies, package = "MASS")

You can access documentation of the data (if available) using the help function specifying the package name in the argument.

Code

help(galaxies, package = "MASS")

What does the data contain? And what is the data source?

Based on the description in the R Help for the data, what would be an appropriate null distribution of this data?

How many observations are there?

If the data is multimodal, which of the following displays do you think would be the best? Which would not be at all useful?

histogram
boxplot
density plot
violin plot
jittered dot plot
letter value plot

Make these plots for the data. Experiment with different binwidths for the histogram and different bandwiths for the density plot. Were you right in your thinking about which would be the best?

Fit your best mixture model to the data, and simulate 19 nulls to make a lineup. Did you do a good job in matching the distribution, ie does the data plot stand out or not? (Extra bonus: What is the model that you have created? Can you make a plot to show how it looks relative to the observed data?)

This code might be helpful to get you started. This code generates a jittered dotplot, but you can use your preferred type from part e.

Code

# Fit a mixture model
library(mixtools)
galaxies_fit <- normalmixEM(galaxies, k=3)

set.seed(1138)
galaxies_sim1 <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)

Code

# Plot your data
ggplot(tibble(galaxies_sim1), aes(x=galaxies_sim1)) +
  geom_quasirandom(aes(x=1, y=galaxies_sim1)) + 
  coord_flip() +
  theme(
    aspect.ratio = 0.7,
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

Code

# Generate null plots and make a lineup
galaxies_null <- tibble(.sample=1, galaxies=galaxies_sim1)
for (i in 2:19) {
  gsim <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)
  galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=i, galaxies=gsim))
}
galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=20,
                                    galaxies=galaxies))
# Randomise .sample  to hide data plot
galaxies_null$.sample <- rep(sample(1:20, 20), rep(82, 20))
ggplot(tibble(galaxies_null), aes(x=galaxies)) +
  geom_quasirandom(aes(x=1, y=galaxies)) + 
  facet_wrap(~.sample, ncol=5) +
  coord_flip() +
  theme(
    aspect.ratio = 0.7,
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

2. What is the best transformation to make?

For each of the variables in the data, which-transform.csv, decide on an appropriate transformation to make the distribution more symmetric for five of the variables and remove discreteness on one variable.