Code
<- read_csv("https://ddde.numbat.space/data/fr-crab.csv") %>%
fr_crabs mutate(Sex = factor(Sex, levels=c(1,2),
labels=c("m", "f")))
Exploring bivariate dependencies
Prof. Di Cook
Mud crabs are delicious to eat! Prof Cook’s father started a crab farm at Fisherman’s Reach, NSW, when he retired. He caught small crabs (with a special license) and nurtured and fed the crabs until they were marketable size. They were then sent to market, like Queen Victoria Market in Melbourne, for people to buy to eat. Mud crabs have a strong and nutty flavour, and a good to eat simply after steaming or boiling.
Early in the farming setup, he collected the measurements of 62 crabs of different sizes, because he wanted to learn when was the best time to send the crab to market. Crabs re-shell from time to time. They grow too big for their shell, and need to discard it. Ideally, the crabs should be sent to market just before they re-shell, because they will be crab will be fuller in the shell, less air, less juice and more crab meat.
Note: In NSW it is legal to sell female mud crabs, as long as they are not carrying eggs. In Queensland, it is illegal to keep and sell female mud crabs. Focusing only on males could be worthwhile.
North coast of NSW, north-east of Kempsey, on the back-water of the Macleay River. We would expect it to be positive, and maybe nonlinear, because weight is more related to volume of the crab than length, which would be length\(^p\).
Wgt
should be considered the dependent variable.
It is a little nonlinear, positive and strong relationship. Weight should be considered dependent.
If you are unsure about a nonlinear relationship, fit a linear model and look at the residuals. In the plots below you can see the residuals have a U-shape, and also have major heteroskedasticity.
library(broom)
library(patchwork)
cr_lm <- lm(Wgt ~ Length.NSW, data = fr_crabs)
fr_crabs <- augment(cr_lm, fr_crabs)
p1 <- ggplot(fr_crabs,
aes(x = Length.NSW,
y = Wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = .resid)) +
geom_point()
p1 + p2 + plot_layout(ncol = 2)
p1 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt^1/3)) +
geom_point() +
ylab("Cube root Wgt")
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt)) +
geom_point() +
scale_y_sqrt() +
ylab("Square root Wgt")
p3 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt)) +
geom_point() +
scale_y_log10() +
ylab("Log10 Wgt")
p1 + p2 + p3 + plot_layout(ncol = 3)
p1 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt, colour = Sex)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
theme(legend.position = "bottom")
p2 <- ggplot(fr_crabs, aes(x = Length.NSW, y = Wgt, colour = Sex)) +
geom_point() +
scale_y_log10() +
scale_colour_brewer(palette = "Dark2") +
theme(legend.position = "bottom") +
ylab("Log10 Wgt")
p1 + p2 + plot_layout(ncol = 2)
When only looking at males, a cube root transformation works best, and this matches the original thinking that length is related to weight in a cubic relationship, perhaps.
fr_crabs_m <- fr_crabs %>%
filter(Sex == "m")
cr_m_lm <- lm(Wgt^(1/3) ~ Length.NSW, data = fr_crabs_m)
fr_crabs_m <- augment(cr_m_lm, fr_crabs_m)
coefs <- tidy(cr_m_lm)
p1 <- ggplot(fr_crabs_m,
aes(x = Length.NSW,
y = Wgt^(1/3))) +
geom_point() +
geom_abline(intercept=coefs$estimate[1],
slope=coefs$estimate[2])
p2 <- ggplot(fr_crabs_m,
aes(x = Length.NSW,
y = .resid)) +
geom_point()
p1 + p2 + plot_layout(ncol = 2)
You would select the bigger crabs that are heavier for their length. If they are heavier, it should mean that they are more fully fitting into their shell.
case1202
from the Sleuth2
package. What does the variable “Senior” measure? “Exper”? Age?Senior is the seniority of the employee in the company. Experience is the months of prior experience when starting at the company. Age is given in months, which is a bit strange!
p1 <- ggplot(case1202, aes(x = Age, y = Senior)) +
geom_point() +
theme(aspect.ratio = 1)
p2 <- ggplot(case1202, aes(x = Exper, y = Senior)) +
geom_point() +
theme(aspect.ratio = 1)
p3 <- ggplot(case1202, aes(x = Age, y = Exper)) +
geom_point() +
theme(aspect.ratio = 1)
p1 + p2 + p3 + plot_layout(ncol = 3)
Experience and age have a moderate to strong positive association. There is no apparent relationship between seniority and age, or experience.
It would be good to check these last two statements using visual inference.
p1 <- ggplot(case1202, aes(x = Age, y = Senior, colour = Sex)) +
geom_point() +
scale_colour_brewer("", palette = "Dark2") +
theme(legend.position = "bottom", aspect.ratio = 1)
p2 <- ggplot(case1202, aes(x = Exper, y = Senior, colour = Sex)) +
geom_point() +
scale_colour_brewer("", palette = "Dark2") +
theme(legend.position = "bottom", aspect.ratio = 1)
p3 <- ggplot(case1202, aes(x = Age, y = Exper, colour = Sex)) +
geom_point() +
scale_colour_brewer("", palette = "Dark2") +
theme(legend.position = "bottom", aspect.ratio = 1)
p1 + p2 + p3 + plot_layout(ncol = 3)
It’s not clear that there are any patterns here. Maybe some small patterns: (1) that at older age and low seniority, there are only female employees, (2) and older age less experience, there are only female employees.
Check this with visual inference.
set.seed(659)
ggplot(lineup(null_permute("Sex"), case1202),
aes(x = Age,
y = Senior,
colour = Sex)) +
geom_point(alpha=0.8) +
#geom_density2d() +
scale_colour_brewer("", palette = "Dark2") +
facet_wrap(~.sample, ncol=5) +
theme_bw() +
theme(legend.position = "none",
axis.text = element_blank(),
axis.title = element_blank())
p1 <- ggplot(case1202, aes(x = Senior)) +
geom_histogram(binwidth = 5, colour = "white") +
facet_wrap(~Sex, ncol = 1, scales="free_y")
p2 <- ggplot(case1202, aes(x = Age)) +
geom_histogram(binwidth = 50, colour = "white") +
facet_wrap(~Sex, ncol = 1, scales="free_y")
p3 <- ggplot(case1202, aes(x = Exper)) +
geom_histogram(binwidth = 50, colour = "white") +
facet_wrap(~Sex, ncol = 1, scales="free_y")
p1 + p2 + p3 + plot_layout(ncol = 3)
There are lot of different patterns!
The distribution of seniority for women is quite uniform, but for men has a clump around 85-90.
With age, there is a bimodal pattern for women, young and older - I wonder if there is a child-bearing years drop out among women. With men, there is a large spike, of young men at 350.
The experience that employees come into the company with has a different distribution for men and women. For men there is a peak at 50. For women the peak is at 50, too but it has a much stronger right-tail. This suggests that men are being hired into the company with less experience than the women.
This is where we see a bigger difference: the women generally have lower salaries than men.
Age
, Senior
and Exper
, separately by Sex, by adding a fitted linear model to the scatterplot where Sex is mapped to colour. What is the relationship and what do you learn?p1 <- ggplot(case1202, aes(x = Age,
y = Bsal,
colour = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_colour_brewer("", palette = "Dark2")
p2 <- ggplot(case1202, aes(x = Senior,
y = Bsal,
colour = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_colour_brewer("", palette = "Dark2")
p3 <- ggplot(case1202, aes(x = Exper,
y = Bsal,
colour = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_colour_brewer("", palette = "Dark2")
p1 + p2 + p3 +
plot_layout(ncol = 3, guides = "collect") &
theme(legend.position = "bottom")
For the same age, seniority, experience women were paid less than the men, on average. The annual salary tends to decline with seniority - that’s a bit surprising. With seniority, there is the same decreasing trend, but women are paid a full $1000 less across the range of seniority, on average. The annual salary for women tends to increase with more experience, and is almost equal between the sexes at the most senior levels.
geom_smooth(method="lm")
to add a fitted model to the scatterplot, is it adding a model with interactions?Technically no, but practically yes! ggplot
fits a separate model to each subset, which will fit a different slope and intercept. It also fits a different error model to each subset, and this what makes it different from a single model with interaction terms.
We can’t get exactly the same age, seniority and experience with a small sample of data, but we could get close to this by binning the variables.
case1202 <- case1202 %>%
mutate(Exper_c = cut(Exper,
breaks=c(-1, 50, 100, 400),
labels=c("E-low",
"E-med",
"E-high")),
Senior_c = cut(Senior, 3,
labels=c("S-low",
"S-med",
"S-high")))
ggplot(case1202, aes(x = Age,
y = Bsal,
colour = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_colour_brewer("", palette = "Dark2") +
facet_grid(Exper_c~Senior_c)
The plots suggest that there is evidence of sex discrimination. Particularly, the last comparison, when the data is subdivided into more homogeneous groups, there is still a disparity in salary, in every category.