ETC5521 Worksheet Week 6

Exploring bivariate dependencies

Author

Prof. Di Cook

Exercise 1: Fisherman’s Reach crabs

Mud crabs are delicious to eat! Prof Cook’s father started a crab farm at Fisherman’s Reach, NSW, when he retired. He caught small crabs (with a special license) and nurtured and fed the crabs until they were marketable size. They were then sent to market, like Queen Victoria Market in Melbourne, for people to buy to eat. Mud crabs have a strong and nutty flavour, and a good to eat simply after steaming or boiling.

Early in the farming setup, he collected the measurements of 62 crabs of different sizes, because he wanted to learn when was the best time to send the crab to market. Crabs re-shell from time to time. They grow too big for their shell, and need to discard it. Ideally, the crabs should be sent to market just before they re-shell, because they will be crab will be fuller in the shell, less air, less juice and more crab meat.

Note: In NSW it is legal to sell female mud crabs, as long as they are not carrying eggs. In Queensland, it is illegal to keep and sell female mud crabs. Focusing only on males could be worthwhile.

Code

fr_crabs <- read_csv("https://ddde.numbat.space/data/fr-crab.csv") %>%
  mutate(Sex = factor(Sex, levels=c(1,2),
                      labels=c("m", "f")))

Where is Fisherman’s Reach? What would you expect the relationship between Length and Weight of a crab to be?

Solution

North coast of NSW, north-east of Kempsey, on the back-water of the Macleay River. We would expect it to be positive, and maybe nonlinear, because weight is more related to volume of the crab than length, which would be length$^p$.

Make a scatterplot of Weight by NSW Length. Describe the relationship. It might be even better if you can add marginal density plots to the sides of the scatterplot. (Aside: Should one variable be considered a dependent variable? If so, make sure this is on the $y$ axis.)

Solution

Wgt should be considered the dependent variable.

Code

ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt)) +
  geom_point()

It is a little nonlinear, positive and strong relationship. Weight should be considered dependent.

If you are unsure about a nonlinear relationship, fit a linear model and look at the residuals. In the plots below you can see the residuals have a U-shape, and also have major heteroskedasticity.

Code

library(broom)
library(patchwork)
cr_lm <- lm(Wgt ~ Length.NSW, data = fr_crabs)
fr_crabs <- augment(cr_lm, fr_crabs)
p1 <- ggplot(fr_crabs, 
             aes(x = Length.NSW, 
                 y = Wgt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = .resid)) +
  geom_point()
p1 + p2 + plot_layout(ncol = 2)

Examine transformations to linearise the relationship. (Think about why the relationship between Length and Weight is nonlinear.)

Solution

Code

p1 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt^1/3)) +
  geom_point() +
  ylab("Cube root Wgt")
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt)) +
  geom_point() +
  scale_y_sqrt() +
  ylab("Square root Wgt")
p3 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt)) +
  geom_point() +
  scale_y_log10() +
  ylab("Log10 Wgt")
p1 + p2 + p3 + plot_layout(ncol = 3)

Is there possibly a lurking variable? Examine the variables in the data, and use colour in the plot to check for another variable explaining some of the relationship.

Solution

Code

p1 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt, colour = Sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  theme(legend.position = "bottom")
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt, colour = Sex)) +
  geom_point() +
  scale_y_log10() +
  scale_colour_brewer(palette = "Dark2") +
  theme(legend.position = "bottom") +
  ylab("Log10 Wgt")
p1 + p2 + plot_layout(ncol = 2)

If you have determined that the is a lurking variable, make changes in the plots to find the best model of the relationship between Weight and Length.

Solution

When only looking at males, a cube root transformation works best, and this matches the original thinking that length is related to weight in a cubic relationship, perhaps.

Code

fr_crabs_m <- fr_crabs %>%
  filter(Sex == "m")
cr_m_lm <- lm(Wgt^(1/3) ~ Length.NSW, data = fr_crabs_m)
fr_crabs_m <- augment(cr_m_lm, fr_crabs_m)
coefs <- tidy(cr_m_lm)

p1 <- ggplot(fr_crabs_m, 
       aes(x = Length.NSW, 
           y = Wgt^(1/3))) +
  geom_point() +
  geom_abline(intercept=coefs$estimate[1], 
              slope=coefs$estimate[2])
p2 <- ggplot(fr_crabs_m, 
       aes(x = Length.NSW, 
           y = .resid)) +
  geom_point() 
p1 + p2 + plot_layout(ncol = 2)

How would you select the crabs that were close to re-shelling based on this data?

Solution

You would select the bigger crabs that are heavier for their length. If they are heavier, it should mean that they are more fully fitting into their shell.

Exercise 2: Bank discrimination

Code

data(case1202, package = "Sleuth2")

Look at the help page for the case1202 from the Sleuth2 package. What does the variable “Senior” measure? “Exper”? Age?

Solution

Senior is the seniority of the employee in the company. Experience is the months of prior experience when starting at the company. Age is given in months, which is a bit strange!

Make all the pairwise scatterplots of Senior, Exper and Age. What do you learn about the relationship between these three pairs of variables? How can the age be 600? Are there some wizards or witches or vampires in the data?

Solution

Code

p1 <- ggplot(case1202, aes(x = Age, y = Senior)) +
  geom_point() +
  theme(aspect.ratio = 1)
p2 <- ggplot(case1202, aes(x = Exper, y = Senior)) +
  geom_point() +
  theme(aspect.ratio = 1)
p3 <- ggplot(case1202, aes(x = Age, y = Exper)) +
  geom_point() +
  theme(aspect.ratio = 1)
p1 + p2 + p3 + plot_layout(ncol = 3)

Experience and age have a moderate to strong positive association. There is no apparent relationship between seniority and age, or experience.

It would be good to check these last two statements using visual inference.

Code

set.seed(956)
ggplot(lineup(null_permute("Exper"), case1202),
       aes(x = Exper, y = Senior)) +
  geom_point() + 
  #geom_density2d_filled() +
  facet_wrap(~.sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

Colour the observations by Sex. What do you learn?

Solution

Code

p1 <- ggplot(case1202, aes(x = Age, y = Senior, colour = Sex)) +
  geom_point() +
  scale_colour_brewer("", palette = "Dark2") +
  theme(legend.position = "bottom", aspect.ratio = 1) 
p2 <- ggplot(case1202, aes(x = Exper, y = Senior, colour = Sex)) +
  geom_point() +
  scale_colour_brewer("", palette = "Dark2") +
  theme(legend.position = "bottom", aspect.ratio = 1)
p3 <- ggplot(case1202, aes(x = Age, y = Exper, colour = Sex)) +
  geom_point() +
  scale_colour_brewer("", palette = "Dark2") +
  theme(legend.position = "bottom", aspect.ratio = 1)
p1 + p2 + p3 + plot_layout(ncol = 3)

It’s not clear that there are any patterns here. Maybe some small patterns: (1) that at older age and low seniority, there are only female employees, (2) and older age less experience, there are only female employees.

Check this with visual inference.

Code

set.seed(659)
ggplot(lineup(null_permute("Sex"), case1202), 
       aes(x = Age, 
           y = Senior, 
           colour = Sex)) +
  geom_point(alpha=0.8) +
  #geom_density2d() +
  scale_colour_brewer("", palette = "Dark2") +
  facet_wrap(~.sample, ncol=5) +
  theme_bw() +
  theme(legend.position = "none",
    axis.text = element_blank(),
    axis.title = element_blank())

Instead of scatterplots, make faceted histograms of the three variables by Sex. What do you learn about the difference in distribution of these three variables between the sexes.

Solution

Code

p1 <- ggplot(case1202, aes(x = Senior)) +
  geom_histogram(binwidth = 5, colour = "white") +
  facet_wrap(~Sex, ncol = 1, scales="free_y")
p2 <- ggplot(case1202, aes(x = Age)) +
  geom_histogram(binwidth = 50, colour = "white") +
  facet_wrap(~Sex, ncol = 1, scales="free_y")
p3 <- ggplot(case1202, aes(x = Exper)) +
  geom_histogram(binwidth = 50, colour = "white") +
  facet_wrap(~Sex, ncol = 1, scales="free_y")
p1 + p2 + p3 + plot_layout(ncol = 3)

There are lot of different patterns!

The distribution of seniority for women is quite uniform, but for men has a clump around 85-90.

With age, there is a bimodal pattern for women, young and older - I wonder if there is a child-bearing years drop out among women. With men, there is a large spike, of young men at 350.

The experience that employees come into the company with has a different distribution for men and women. For men there is a peak at 50. For women the peak is at 50, too but it has a much stronger right-tail. This suggests that men are being hired into the company with less experience than the women.

The data also has 1975 salary and annual salary. Plot these two variables, in two ways: (1) coloured by Sex, and (2) faceted by Sex. Explain the relationships.

Solution

Code

ggplot(case1202, aes(x = Sal77, 
                     y = Bsal, 
                     colour=Sex)) +
  geom_point(size=2, alpha=0.8) +
  scale_colour_brewer("", palette = "Dark2") +
  theme(aspect.ratio=1)

Code

ggplot(case1202, aes(x = Sal77, y = Bsal)) +
  geom_point() +
  facet_wrap(~Sex) + 
  theme(aspect.ratio=1)

This is where we see a bigger difference: the women generally have lower salaries than men.

Examine the annual salary against Age, Senior and Exper, separately by Sex, by adding a fitted linear model to the scatterplot where Sex is mapped to colour. What is the relationship and what do you learn?

Solution

Code

p1 <- ggplot(case1202, aes(x = Age, 
                           y = Bsal, 
                           colour = Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_colour_brewer("", palette = "Dark2") 
p2 <- ggplot(case1202, aes(x = Senior, 
                           y = Bsal, 
                           colour = Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_colour_brewer("", palette = "Dark2") 
p3 <- ggplot(case1202, aes(x = Exper, 
                           y = Bsal, 
                           colour = Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_colour_brewer("", palette = "Dark2") 
p1 + p2 + p3 + 
  plot_layout(ncol = 3, guides = "collect") & 
  theme(legend.position = "bottom")

For the same age, seniority, experience women were paid less than the men, on average. The annual salary tends to decline with seniority - that’s a bit surprising. With seniority, there is the same decreasing trend, but women are paid a full $1000 less across the range of seniority, on average. The annual salary for women tends to increase with more experience, and is almost equal between the sexes at the most senior levels.

When you use geom_smooth(method="lm") to add a fitted model to the scatterplot, is it adding a model with interactions?

Solution

Technically no, but practically yes! ggplot fits a separate model to each subset, which will fit a different slope and intercept. It also fits a different error model to each subset, and this what makes it different from a single model with interaction terms.

There is danger of misinterpreting differences when only examining marginal plots. What we need to know is: for a person with the same age, same experience, same seniority, is the salary different for men and women. How would you make plots to try to examine this?

Solution

We can’t get exactly the same age, seniority and experience with a small sample of data, but we could get close to this by binning the variables.

Code

case1202 <- case1202 %>%
  mutate(Exper_c = cut(Exper, 
                       breaks=c(-1, 50, 100, 400), 
                       labels=c("E-low", 
                                "E-med", 
                                "E-high")),
         Senior_c = cut(Senior, 3, 
                        labels=c("S-low", 
                                 "S-med",
                                 "S-high")))
ggplot(case1202, aes(x = Age, 
                           y = Bsal, 
                           colour = Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_colour_brewer("", palette = "Dark2") +
  facet_grid(Exper_c~Senior_c)

Would you say that this data provides evidence of sex discrimination?

Solution

The plots suggest that there is evidence of sex discrimination. Particularly, the last comparison, when the data is subdivided into more homogeneous groups, there is still a disparity in salary, in every category.