options(stringsAsFactors=FALSE)
Session 4 Starter Code w/ Key
Set-up
Warm-up
1. Sign in! Go to: https://forms.gle/exqnxGbkBMJur1FM8
2. Load the tidyverse
and the psych
packages
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.4.0 ✔ purrr 0.3.4
✔ tibble 3.1.6 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
Warning: package 'tidyr' was built under R version 4.0.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
3. Load in penguins dataset and View
your data
Don’t forget to double-check that your working directory is set where you want it to be! Use getwd()
to check your working directory, and setwd()
to set your working directory.
<- read.csv("penguins.csv")
penguins View(penguins)
4. Use tidyverse commands to do the following data processing steps (use pipes to connect your commands). Save your new dataset penguins_cleaned
.
a. Remove two columns: body_mass_g
and flipper_length_mm
.
b. Remove NAs
from the dataset
c. Recode one level of the column species
.
Let’s say the researchers made a mistake! Those are actually Great penguins, not Gentoo penguins (oh no!). Make a new variable called species2, where those who have a species of Gentoo are fixed and are “Great”, and the other types of species stay the same (Adelie is still Adelie and Chinstrap is still Chinstrap).
<- penguins %>%
penguins_cleaned select(-body_mass_g, -flipper_length_mm) %>%
drop_na() %>%
mutate(species_recoded = case_when(species == "Gentoo" ~ "Great",
TRUE ~ species))
NOTE: This makes it so that the column species_recoded will be “Great” for rows where species is “Gentoo”. The second line (TRUE ~ species) makes it so that all other rows where species is not missing (e.g., where there is a value), species_recoded will equal the same value as species for that row. Or, you can do the final step one by one: mutate(species_recoded = case_when(species == "Adelie" ~ "Adelie", species == "Chinstrap" ~ "Chinstrap", species == "Gentoo" ~ "Great"))
5. Using the describe()
function in the psych
package, look at the variable bill_length_mm
.
describe(penguins_cleaned$bill_length_mm)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 333 43.99 5.47 44.5 43.98 6.97 32.1 59.6 27.5 0.04 -0.9 0.3
Data Visualization Demo
Resources
This website has some helpful info about plotting with ggplot: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
ggplot cheat sheet: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf
R-graph gallery: Help choosing and creating types of plots (NOTE: only use the ggplot section on each page) https://www.r-graph-gallery.com/index.html
Plot 1: Histogram
Let’s look at the distribution of bill_length_mm
with a histogram.
1a. Set up our ggplot:
ggplot(data = penguins_cleaned, aes(x = bill_length_mm)) +
1b. Add our histogram object:
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
NOTE: that fill
changes the color of the fill, color
changes the color of the outline, geom_histogram(color=black)
outlines the boxes in black
Plot 2: Bar Plot
Let’s compare the bill length of each penguin species using a barplot
2a. Set up our ggplot
ggplot(data = penguins_cleaned, aes(x = species, y=bill_length_mm)) +
2b. Add a column object
geom_col(position = "dodge")
Notice that the species are ordered alphabetically. We could change this by creating an ordered factor. e.g., factor(species, levels = c("Chinstrap", "Adelie", "Gentoo"))
Bar charts aren’t the best way to look at data because it doesn’t tell us much about individual data points or the distribution of data.
Plot 3: Box Plot & Violin Plot
Let’s compare the bill length of each penguin species using some other kinds of plots.
3a. Let’s try a box plot.
i. Set up our ggplot
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm)) +
ii. Add a box object
geom_boxplot()
This gives us some more descriptive stats about the data but we still don’t have a good feel for what the distribution of the data points look like.
3b. Let’s use a violin plot to visualize the data.
i. Set up our ggplot
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) +
ii. Add a violin object
geom_violin(trim = FALSE) +
iii. Let’s say we want separate plots for each sex…
facet_wrap(~sex) +
iv. Maybe we want a boxplot on top…
`geom_boxplot(width=0.3)
NOTE: This ordering for layers is important! If we add the boxplot layer before the violin plot layer, the boxplot will not be visible:
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) +
geom_boxplot(width=0.3) +
facet_wrap(~sex) +
geom_violin(trim = FALSE)
Why is that? Because the violin plot is layered on top of it.
A violin plot immediately gives us an idea of the shape of our data. We also see that while the mean bill length of males and females may be similar though the distributions can be quite different.
3c. Now that we are happy with our plot choice, lets customize it further…
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, color=species)) + geom_violin(trim = FALSE) + geom_boxplot(width=0.3) + facet_wrap(~sex) +
v. Change axis labels
labs(x = "Species", y = "Bill Length (mm)") +
vi. Set title
ggtitle("Bill Length by Species") +
vii. Change the y axis scale to start at 25 and end at 65
ylim(25,65) +
viii. Change the color scheme.
There are a few ways to do this. (We’ll just choose one to demonstrate!)
Some color names are built in:
scale_fill_manual(values = c("gray", "midnightblue", "dodgerblue1")) +
We can give it hex values:
scale_fill_manual(values = c("#999999", "#E69F00", "#26A2C6")) +
We can use color palettes:
scale_fill_brewer(palette = "Dark2") +
ix. Change the overall theme
theme_classic()
In case of interest, this website lists a TON of the color names that R has built in: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Here, you can see a list of the palettes to use in scale_fill_brewer()
: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html
EXTRA – What happens if we change how we fill
the aes()
in our graph?
Violin plot, no fill:
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm)) +
geom_violin(trim = FALSE)
Violin plot, fill = species
:
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=species)) +
geom_violin(trim = FALSE)
Violin plot, fill = sex
:
ggplot(data = penguins_cleaned, aes(x = species, y = bill_length_mm, fill=sex)) +
geom_violin(trim = FALSE)
Why does this happen? Remember, aes()
stands for aesthetic mapping! When we tell R to fill by species, and we already have species as the x variable, we are not asking R to map any additional variables onto our graph. So, it sets a different color for each of the species (which was already our x variable).
But, when we set fill = sex, when species is still the x variable, we are asking R to map a third variable to its representation of the data. Therefore, it splits our species graphs by sex.
Plot 4. Scatter Plot
4a. Now lets look at the relationship between bill length and bill depth for each species.
i. Set up our ggplot
ggplot(penguins_cleaned, aes(y = bill_depth_mm, x=bill_length_mm, color=sex)) +
ii. Add a scatter plot
geom_point() +
iii. Add a regression line
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
4b. Overall, there appears to be a negative relationship between bill length and bill depth, and this is true for males and females. BUT, could there be something we aren’t looking at or capturing? What if we break it up by species?
i. Set up our ggplot
ggplot(penguins_cleaned, aes(y = bill_depth_mm, x=bill_length_mm, color=sex)) +
ii. Add a scatter plot
geom_point() +
iii. Add a regression line
geom_smooth(method = "lm") +
iv. Split by species
facet_wrap(~species)
`geom_smooth()` using formula = 'y ~ x'
What do we see? For each species, the relationship between bill length and bill depth is positive! It seems that because of differences in bill length/depth across species, this relationship appeared negative when examined at the population level, but positive at the species level. This is why graphing (potentially by group) is so important!