Code
# remotes::install_github("llendway/gardenR")
# install.packages("tidytuesdayR")
library(tidytuesdayR)
<- tidytuesdayR::tt_load('2024-05-28')
tuesdata save(tuesdata, file="tuesdata.rda")
Introduction to exploratory data analysis
Prof. Di Cook
The goal of this worksheet is to tackle a data analysis together, by
The data to use is available from Tidy Tuesday 28 May 2024 page. Download the data from here, ideally using the tidytuesdayR
package. You should only download the data from the Tidy Tuesday once, and save a copy locally on your computer.
In addition the gardenR
package, available from remotes::install_github("llendway/gardenR")
has extra details about the garden.
This is the code to download the data, and load relevant libraries.
library(gardenR)
library(tidyverse)
library(ggbeeswarm)
load("tuesdata.rda")
harvest_2020 <- tuesdata$harvest_2020
harvest_2021 <- tuesdata$harvest_2021
planting_2020 <- tuesdata$planting_2020
planting_2021 <- tuesdata$planting_2021
spending_2020 <- tuesdata$spending_2020
spending_2021 <- tuesdata$spending_2021
data("garden_coords")
This is collected in 2020 and 2021, and has a variety of numeric and categorical and time variables.
The link to the ChatGPT conversation is here. It was prompted by
We chose to tackle one of the Economics questions: “How much produce (by weight or value) was harvested per dollar spent?” But realised that it was not possible answer this particular question with this data. It was refined to be:
Compare the ROI for varieties of beans.
The next step was to filter the data to focus on one vegetable, beans, and one year to get started.
We might expect some difference in ROI between varieties.
We need to decide on a common scale. Steps are:
# A tibble: 3 × 3
variety brand price_with_tax
<chr> <chr> <dbl>
1 Bush Bush Slender Renee's Garden 3.01
2 Chinese Red Noodle Baker Creek 3.24
3 Classic Slenderette Renee's Garden 3.23
beans_planting_2020_smry <- beans_planting_2020 |>
group_by(variety) |>
summarise(number_seeds_planted = sum(number_seeds_planted))
beans_2020_smry <- beans_spending_2020 |>
select(variety, brand, price_with_tax) |>
left_join(beans_planting_2020_smry)
ggplot(beans_2020_smry,
aes(price_with_tax,
number_seeds_planted)) + geom_point()
# A tibble: 3 × 2
variety pr_per_seed
<chr> <dbl>
1 Bush Bush Slender 0.0752
2 Chinese Red Noodle 0.324
3 Classic Slenderette 0.111
# A tibble: 3 × 2
variety weight
<chr> <dbl>
1 Bush Bush Slender 10038
2 Chinese Red Noodle 356
3 Classic Slenderette 1635
# A tibble: 3 × 6
variety brand price_with_tax number_seeds_planted weight psy
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bush Bush Slender Renee's … 3.01 40 10038 251.
2 Chinese Red Noodle Baker Cr… 3.24 10 356 35.6
3 Classic Slenderette Renee's … 3.23 29 1635 56.4
Problems encountered.
Adjustments:
Conclusion:
Bush Bush Slender outperforms the other two, by a lot! This should also indicate that this variety is a better ROI.
Caveats: Need to check that
Examine the 2021 data. If same varieties grown do the same results happen.
Problems discovered in doing this: Name of variety in 2021 might have changed to be just “Bush”, and the other two were not used. Could compare against the one new variety.