ETC5521 Tutorial 4

Initial data analysis

Author

Prof. Di Cook

🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

🔧 Preparation

The reading for this week is Wickham et al. (2010) Graphical inference for Infovis.
- Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("tidyverse"))

Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

This tutorial focuses on IDA for the gardenR data, with the goal to answer this question:

Which variety of tomato produces the most return on investment, as measured by weight?

Exercise 1

How many types of vegetables were grown in each year?
How many vegetables were grown in 2020 that were not grown in 2021?
What are some of the data recording errors that can be seen by comparing vegetables grown in each year?

Solution

library(gardenR)
library(tidyverse)
library(ggbeeswarm)

data("garden_coords")
data("garden_harvest")
data("garden_planting")
data("garden_spending")
data("harvest_2021")
data("planting_2021")
data("spending_2021")

31 grown in 2020, and 33 grown in 2021.

To work out which differ between years, use the anti_join() function. Because there are multiple plots of vegetables use distinct() to remove duplicates, and then arrange alphabetically for ease of comparison.

in20butnot21 <- anti_join(select(garden_harvest, vegetable), select(harvest_2021, vegetable), by="vegetable") |> 
  distinct() |> 
  arrange(vegetable)
in21butnot20 <- anti_join(select(harvest_2021, vegetable), select(garden_harvest, vegetable), by="vegetable") |> 
  distinct() |> 
  arrange(vegetable)
in20butnot21$vegetable

[1] "Swiss chard" "apple"       "broccoli"    "chives"      "hot peppers"
[6] "jalapeño"    "kohlrabi"    "onions"      "pumpkins"

in21butnot20$vegetable

 [1] "apples"       "cabbage"      "dill"         "garlic"       "mint"        
 [6] "oregano"      "pumpkin"      "sweet potato" "swiss chard"  "tomatillos"  
[11] "watermelon"

swiss chard was grown in each year but capital “S” was used in 2020.
apples were grown in each year, but singular name was used in 2021.
similarly for pumpkins but in reverse, singular used in 2020!

harvest_2020 <- garden_harvest |>
  mutate(vegetable = tolower(vegetable),
         variety = tolower(variety)) |>
  mutate(vegetable = case_match(
    vegetable,
    "apple" ~ "apples",
    .default = vegetable)
  )
harvest_2021 <- harvest_2021 |>
  mutate(vegetable = case_match(
    vegetable,
    "pumpkin" ~ "pumpkins",
    .default = vegetable)
  )

So final answer for (b) is here are the vegetables not grown both years:

[1] "Swiss chard" "apple"       "broccoli"    "chives"      "hot peppers"
[6] "jalapeño"    "kohlrabi"    "onions"

 [1] "apples"       "cabbage"      "dill"         "garlic"       "mint"        
 [6] "oregano"      "sweet potato" "swiss chard"  "tomatillos"   "watermelon"

Exercise 2

Join the harvest, spending and planting data for the two years, after adding a new variable each, called year. Show your code.
Make a subset containing just the tomatoes, for each set.
Are the varieties of tomatoes grown each year the same?
Are the tomato varieties grown in the same plots each year?
When are tomatoes planted and harvested, in Lisa’s garden?

Solution

harvest_2020 <- garden_harvest |>
  mutate(year = "2020")
harvest_2021 <- harvest_2021 |>
  mutate(year = "2021")
harvest <- bind_rows(harvest_2020, harvest_2021)

spending_2020 <- garden_spending |>
  mutate(year = "2020")
spending_2021 <- spending_2021 |>
  mutate(year = "2021")
spending <- bind_rows(spending_2020, spending_2021)

planting_2020 <- garden_planting |>
  mutate(year = "2020")
planting_2021 <- planting_2021 |>
  mutate(year = "2021")
planting <- bind_rows(planting_2020, planting_2021)

harvest_toms <- harvest |> filter(vegetable == "tomatoes")
spending_toms <- spending |> filter(vegetable == "tomatoes")
planting_toms <- planting |> filter(vegetable == "tomatoes")

tom_smry <- harvest_toms |> 
  count(variety, year) |> 
  pivot_wider(names_from = year, values_from = n)
tom_smry

# A tibble: 19 × 3
   variety          `2020` `2021`
   <chr>             <int>  <int>
 1 Amish Paste          30     19
 2 Better Boy           23     NA
 3 Big Beef             21     24
 4 Black Krim           12     12
 5 Bonny Best           27     15
 6 Brandywine           16     NA
 7 Bush Goliath         NA      6
 8 Cherokee Purple      14     10
 9 Early Girl           NA     29
10 Jet Star             13     NA
11 Mortgage Lifter      18     22
12 Old German           19      4
13 San Marzano          NA     19
14 Striped German       NA      8
15 Sweet 100 Cherry     NA     28
16 grape                39     NA
17 volunteer            NA     16
18 volunteers           31     NA
19 yellow               NA      4

n_complete <- tom_smry |> mutate(m=`2020`+`2021`) |> filter(!is.na(m)) |> nrow()

There are a lot of varieties of tomatoes grown, only 7 are grown in both years.

Use only the varieties grown in both years.

Code

tom_both_yrs <- tom_smry |> mutate(m=`2020`+`2021`) |> filter(!is.na(m)) 
planting_toms |> 
  filter(variety %in% tom_both_yrs$variety) |>
  count(variety, plot, year) |>
  pivot_wider(names_from = plot, values_from = n)

# A tibble: 14 × 6
   variety         year      D     J     N     O
   <chr>           <chr> <int> <int> <int> <int>
 1 Amish Paste     2021      1    NA     1    NA
 2 Amish Paste     2020     NA     1     1    NA
 3 Big Beef        2021      1    NA     1    NA
 4 Big Beef        2020     NA    NA     1    NA
 5 Black Krim      2020     NA    NA     1    NA
 6 Black Krim      2021     NA    NA     1    NA
 7 Bonny Best      2020     NA     1    NA    NA
 8 Bonny Best      2021     NA    NA    NA     1
 9 Cherokee Purple 2020     NA     1    NA    NA
10 Cherokee Purple 2021     NA    NA     1     1
11 Mortgage Lifter 2021      1    NA     1    NA
12 Mortgage Lifter 2020     NA     1     1    NA
13 Old German      2021      1    NA    NA    NA
14 Old German      2020     NA     1    NA    NA

Not a single variety is grown in the same plot each year. This might cause problems, if the plots are not equally good for growing tomatoes.

Again, just use the tomatoes that are grown in each year.

Code

planting_toms_sub <- planting_toms |>
  filter(variety %in% tom_both_yrs$variety) |>
  mutate(type = "planting") |>
  select(variety, date, type, year)
harvest_toms_sub <- harvest_toms |>
  filter(variety %in% tom_both_yrs$variety) |>
  mutate(type = "harvest") |>
  select(variety, date, type, year)
tom_variety_order <- harvest_toms_sub |>
  group_by(variety) |>
  summarise(date = min(date)) |>
  arrange(date)
p_h_tom <- bind_rows(planting_toms_sub, harvest_toms_sub) |>
  mutate(day_yr = yday(date)) |>
  mutate(variety = factor(variety, tom_variety_order$variety))
ggplot(p_h_tom) +
  geom_point(aes(x=day_yr, y=year, colour=type)) +
  facet_wrap(~variety, ncol=4, scales="free_x") +
  scale_colour_brewer("", palette="Dark2") +
  xlab("day of year") + ylab("") +
  theme(legend.position="bottom")

Planting is usually in late May, and it is consistent for all the varieties.

Harvesting starts around 50 days after planting. 2020 had an earlier harvest than 2021 for all varieties. Big Beef tends to be harvested first, and Black Kim later. Old German had a poor harvest in 2021 relative to 2022.

Exercise 3 Try to answer the original question.

How should you calibrate weight of harvest by amount of seeds planted?
Which variety produces the most return on investment?

Solution

We’ll divide the weight by number of seeds planted. An interesting observation is that so few tomato seeds were planted! Packets of tomato seeds have lots of seeds.

Code

planting_smry <- planting_toms |>
  filter(variety %in% tom_both_yrs$variety) |>
  select(variety, date, year, number_seeds_planted, number_seeds_exact) |>
  group_by(variety, year) |>
  summarise(nseeds = sum(number_seeds_planted))
harvest_smry <- harvest_toms |>
  filter(variety %in% tom_both_yrs$variety) |>
  group_by(variety, year) |>
  summarise(weight = sum(weight))
tom_weight <- left_join(planting_smry, harvest_smry) |>
  mutate(wgt_ps = weight/nseeds)

ggplot(tom_weight, aes(
  x = fct_reorder(variety, wgt_ps),
  y = wgt_ps,
  colour = year)) +
  geom_point() + 
  coord_flip() +
  xlab("") + ylab("weight per seed") +
  scale_colour_brewer("", palette="Dark2")

Big Beef is a consistently high performer, over these two years. Old German is the most varied, top by weight in 2020 but failed in 2021.

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.