Session 4 Starter Code w/ Key

Author

Sierra Semko Krouse & Emily Rosenthal

Published

July 12, 2023

Set-up

options(stringsAsFactors=FALSE)

Warm-up

1. Sign in! Go to: https://forms.gle/exqnxGbkBMJur1FM8

2. Load the tidyverse and the psych packages

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
Warning: package 'tidyr' was built under R version 4.0.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha

3. Load in penguins dataset and View your data

Don’t forget to double-check that your working directory is set where you want it to be! Use getwd() to check your working directory, and setwd() to set your working directory.

penguins <- read.csv("penguins.csv")
View(penguins)

4. Use tidyverse commands to do the following data processing steps (use pipes to connect your commands). Save your new dataset penguins_cleaned.

a. Remove two columns: body_mass_g and flipper_length_mm.

b. Remove NAs from the dataset

c. Recode one level of the column species.

Let’s say the researchers made a mistake! Those are actually Great penguins, not Gentoo penguins (oh no!). Make a new variable called species2, where those who have a species of Gentoo are fixed and are “Great”, and the other types of species stay the same (Adelie is still Adelie and Chinstrap is still Chinstrap).

penguins_cleaned <- penguins %>% 
  select(-body_mass_g, -flipper_length_mm) %>% 
  drop_na() %>%
  mutate(species_recoded = case_when(species == "Gentoo" ~ "Great", 
                              TRUE ~ species))

NOTE: This makes it so that the column species_recoded will be “Great” for rows where species is “Gentoo”. The second line (TRUE ~ species) makes it so that all other rows where species is not missing (e.g., where there is a value), species_recoded will equal the same value as species for that row. Or, you can do the final step one by one: mutate(species_recoded = case_when(species == "Adelie" ~ "Adelie", species == "Chinstrap" ~ "Chinstrap", species == "Gentoo" ~ "Great"))

5. Using the describe() function in the psych package, look at the variable bill_length_mm.

describe(penguins_cleaned$bill_length_mm)
   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
X1    1 333 43.99 5.47   44.5   43.98 6.97 32.1 59.6  27.5 0.04     -0.9 0.3

Data Visualization Demo

Resources

  • This website has some helpful info about plotting with ggplot: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html

  • ggplot cheat sheet: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf

  • R-graph gallery: Help choosing and creating types of plots (NOTE: only use the ggplot section on each page) https://www.r-graph-gallery.com/index.html

Plot 1: Histogram

Let’s look at the distribution of bill_length_mm with a histogram.

1a. Set up our ggplot:

ggplot(data = penguins_cleaned, aes(x = bill_length_mm)) +

1b. Add our histogram object:

geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

NOTE: that fill changes the color of the fill, color changes the color of the outline, geom_histogram(color=black) outlines the boxes in black

Plot 2: Bar Plot

Let’s compare the bill length of each penguin species using a barplot

2a. Set up our ggplot

ggplot(data = penguins_cleaned, aes(x = species, y=bill_length_mm)) +

2b. Add a column object

geom_col(position = "dodge")

Notice that the species are ordered alphabetically. We could change this by creating an ordered factor. e.g., factor(species, levels = c("Chinstrap", "Adelie", "Gentoo"))

Bar charts aren’t the best way to look at data because it doesn’t tell us much about individual data points or the distribution of data.

Plot 3: Box Plot & Violin Plot

Let’s compare the bill length of each penguin species using some other kinds of plots.

3a. Let’s try a box plot.

i. Set up our ggplot

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm)) +

ii. Add a box object

geom_boxplot()

This gives us some more descriptive stats about the data but we still don’t have a good feel for what the distribution of the data points look like.

3b. Let’s use a violin plot to visualize the data.

i. Set up our ggplot

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) +

ii. Add a violin object

geom_violin(trim = FALSE) +

iii. Let’s say we want separate plots for each sex…

facet_wrap(~sex) +

iv. Maybe we want a boxplot on top…

`geom_boxplot(width=0.3)

NOTE: This ordering for layers is important! If we add the boxplot layer before the violin plot layer, the boxplot will not be visible:

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) +
  geom_boxplot(width=0.3) +
  facet_wrap(~sex) +
  geom_violin(trim = FALSE)

Why is that? Because the violin plot is layered on top of it.

A violin plot immediately gives us an idea of the shape of our data. We also see that while the mean bill length of males and females may be similar though the distributions can be quite different.

3c. Now that we are happy with our plot choice, lets customize it further…

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, color=species)) + geom_violin(trim = FALSE) + geom_boxplot(width=0.3) + facet_wrap(~sex) +

v. Change axis labels

labs(x = "Species", y = "Bill Length (mm)") +

vi. Set title

ggtitle("Bill Length by Species") +

vii. Change the y axis scale to start at 25 and end at 65

ylim(25,65) +

viii. Change the color scheme.

There are a few ways to do this. (We’ll just choose one to demonstrate!)

  • Some color names are built in: scale_fill_manual(values = c("gray", "midnightblue", "dodgerblue1")) +

  • We can give it hex values: scale_fill_manual(values = c("#999999", "#E69F00", "#26A2C6")) +

  • We can use color palettes: scale_fill_brewer(palette = "Dark2") +

ix. Change the overall theme

theme_classic()

In case of interest, this website lists a TON of the color names that R has built in: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Here, you can see a list of the palettes to use in scale_fill_brewer(): https://r-graph-gallery.com/38-rcolorbrewers-palettes.html

EXTRA – What happens if we change how we fill the aes() in our graph?

Violin plot, no fill:

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm)) + 
  geom_violin(trim = FALSE)  

Violin plot, fill = species:

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) + 
  geom_violin(trim = FALSE)  

Violin plot, fill = sex:

ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=sex)) +
  geom_violin(trim = FALSE)

Why does this happen? Remember, aes() stands for aesthetic mapping! When we tell R to fill by species, and we already have species as the x variable, we are not asking R to map any additional variables onto our graph. So, it sets a different color for each of the species (which was already our x variable).

But, when we set fill = sex, when species is still the x variable, we are asking R to map a third variable to its representation of the data. Therefore, it splits our species graphs by sex.

Plot 4. Scatter Plot

4a. Now lets look at the relationship between bill length and bill depth for each species.

i. Set up our ggplot

ggplot(penguins_cleaned, aes(y = bill_depth_mm, x=bill_length_mm, color=sex)) +

ii. Add a scatter plot

geom_point() +

iii. Add a regression line

geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

4b. Overall, there appears to be a negative relationship between bill length and bill depth, and this is true for males and females. BUT, could there be something we aren’t looking at or capturing? What if we break it up by species?

i. Set up our ggplot

ggplot(penguins_cleaned, aes(y = bill_depth_mm, x=bill_length_mm, color=sex)) +

ii. Add a scatter plot

geom_point() +

iii. Add a regression line

geom_smooth(method = "lm") +

iv. Split by species

facet_wrap(~species)

`geom_smooth()` using formula = 'y ~ x'

What do we see? For each species, the relationship between bill length and bill depth is positive! It seems that because of differences in bill length/depth across species, this relationship appeared negative when examined at the population level, but positive at the species level. This is why graphing (potentially by group) is so important!