Code
data(galaxies, package = "MASS")
Working with a single variable
Prof. Di Cook
Load the galaxies
data in the MASS
package and answer the following questions based on this dataset.
You can access documentation of the data (if available) using the help
function specifying the package name in the argument.
The data contains velocities in km/sec of 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region. The original data is from Postman et al. (1986) and this data is from Roeder with 83rd observation removed from the original data as well as typo for the 78th observation.
The description in the R help for the data says Multimodality in such surveys is evidence for voids and superclusters in the far universe.
Deciding on an appropriate null hypothesis is always tricky. If we wanted to test the statement that the data is multimodal, we could compare against a unimodal distribution, either a normal or an exponential depending on what shape we might expect.
However, the published work has already made a claim that the data is multimodal, so it would be interesting to determine if we can generate samples from a multimodal distribution that are indistinguishable from the data.
\(H_0:\) The distribution is multimodal. \(H_a:\) The distribution is something other than multimodal.
There are 82 observations.
If you said a density plot, jittered dot plot, or a histogram, you’re on the right track, because each can give a fine resolution for showing modes. (The violin plot is not any different from a density plot, when only looking at one variable.)
g <- ggplot(tibble(galaxies), aes(galaxies)) +
theme(
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)
g1 <- g + geom_histogram(binwidth = 1000, color = "white")
g2 <- g + geom_boxplot()
g3 <- g + geom_density()
g4 <- g + geom_violin(aes(x=galaxies, y=1), draw_quantiles = c(0.25, 0.5, 0.75))
g5 <- g + geom_quasirandom(aes(x=1, y=galaxies)) + coord_flip()
g6 <- g + geom_lv(aes(x=1, y=galaxies)) + coord_flip()
g1 + g2 + g3 + g4 + g5 + g6 + plot_layout(ncol = 2)
This code might be helpful to get you started. This code generates a jittered dotplot, but you can use your preferred type from part e.
# Generate null plots and make a lineup
galaxies_null <- tibble(.sample=1, galaxies=galaxies_sim1)
for (i in 2:19) {
gsim <- rnormmix(n=length(galaxies),
lambda=galaxies_fit$lambda,
mu=galaxies_fit$mu,
sigma=galaxies_fit$sigma)
galaxies_null <- bind_rows(galaxies_null,
tibble(.sample=i, galaxies=gsim))
}
galaxies_null <- bind_rows(galaxies_null,
tibble(.sample=20,
galaxies=galaxies))
# Randomise .sample to hide data plot
galaxies_null$.sample <- rep(sample(1:20, 20), rep(82, 20))
ggplot(tibble(galaxies_null), aes(x=galaxies)) +
geom_quasirandom(aes(x=1, y=galaxies)) +
facet_wrap(~.sample, ncol=5) +
coord_flip() +
theme(
aspect.ratio = 0.7,
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)
For each of the variables in the data, which-transform.csv
, decide on an appropriate transformation to make the distribution more symmetric for five of the variables and remove discreteness on one variable.
Remember the power ladder, go down to fix right-skew, and up to fix left-skew. For multi-modal find an explanatory variable, or do a severe quantile transformation.