R Bootcamp Session 3 Starter Code w/ Key

Author

Sierra Semko Krouse & Emily Rosenthal

Published

June 28, 2023

Set options

options(stringsAsFactors = FALSE)

Warm-up

(1)

Please sign in! Go to: https://tinyurl.com/session3-signin

(2)

Make sure you have downloaded and saved the following three files:

The COVID attitudes data “covid_attitudes_2023.csv” (you should have downloaded this last session)
The penguins data “penguins.csv” (we will use this for the warm up)
The cleaned penguins data “penguins_cleaned.csv”

Note that we are not using the cleaned penguins data right now, but we will be working with it later today.

Make sure all files are saved in the appropriate location!

(3)

Load the penguins data set (the document is titled “penguins.csv”). Name it penguins.

Hint: if it’s not working, check your working directory using the getwd() command. Is it where you want it to be located? If not, set your working directory!

getwd()

[1] "/Users/sierrasemko/Desktop/Bootcamp_2023/session3"

# setwd("/Users/sierrasemko/Desktop/Bootcamp_2023/session3") # replace this path with yours, if you need to set your working directory! 

penguins <- read.csv("../penguins.csv") # remember that you might need to remove "../" from the file name above if your data csv is stored in the same folder as your working directory.

(4)

Look at the structure of your data using the str() command. What types of variables do you have?

str(penguins)

'data.frame':   344 obs. of  9 variables:
 $ penguin          : chr  "p1" "p2" "p3" "p4" ...
 $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : chr  "male" "female" "female" NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

I see character (for example, “penguin” and “species”), numeric (e.g., “bill_length_mm” and “bill_depth_mm”), and integer variables (e.g., “body_mass_g” and “year”) in this data set.

(5)

What is the mean body mass of the penguins (in grams)?

summary(penguins$body_mass_g)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2

The mean body mass (in grams) is 4202.

(6)

How many penguins were measured in each year?

Hint: you will need to make year a factor first.

penguins$year <- factor(penguins$year)

summary(penguins$year)

2007 2008 2009 
 110  114  120

There were 110 measured in 2007, 114 measured in 2008, and 120 measured in 2009.

Loading packages

We want to install the package named psych!

To install this package, there are 2 options:

Go to the “Packages” tab in the bottom right pane of RStudio. Click “Install” and type “psych” into the box and click “Install.”
Instead, you could type install.packages("psych") into the console (be sure that the name of the package is in quotation marks!).

You only need to install the package once! However, you will need to load the package every time you want to use it. You can do this by running this code: library(psych).

Think of the code install.packages("psych") as going to the bookstore and buying a book. Until we take the book off of the shelf and open it, we can’t read it. The code library(psych) is equivalent of opening the book to read (in this metaphor).

# install.packages("psych")

library(psych)

Check again in the “Packages” tab on the bottom-right. If there is a check / tick mark next to the “psych” package, you’re good to go!

Practice

Install and load the tidyverse package

# install.packages("tidyverse")

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.4.0     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

Warning: package 'tidyr' was built under R version 4.0.5

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()   masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()

Intro to tidy data

Check out our penguins data:

View(penguins)

The goal for our tidy penguins data frame is:

Only has columns for the variables penguin, species, island, bill_length_mm, bill_depth_mm, sex, year
Only has the observations for the year 2008
Remove penguins from the data frame that have NAs for any cells
Make a new variable called “bill_sum” that is the sum of bill_length and bill_depth
Make species into a factor
Rename the levels of the variable sex from “female” to “f” and “male” to “m”

This is what we want our data to look like after we do these processing steps:

penguins_cleaned <- read.csv("../penguins_cleaned.csv")

View(penguins_cleaned)

Let’s use the tidyverse package to accomplish this and turn our data into tidy data!

Tidyverse

One of the key features of the tidyverse is the pipe operator %>%.

The pipe operator allows you to string together many functions on the same data frame. You make a workﬂow of tasks that you perform sequentially on a data frame.

Check out the session 3 recording for a more in-depth discussion about this from Emily.

Here are the functions in the tidyverse that will allow us to adjust our data so that it becomes tidy and matches the penguins_cleaned dataset.

`select()`

With this function, we can select only our columns of interest: “penguin”, “species”, “island”, “bill_length_mm”, “bill_depth_mm”, “sex”, “year.” Here, we are making changes to the columns.

penguins_selected <- penguins %>% 
  select(penguin, species, island, bill_length_mm, bill_depth_mm, sex, year)

We’re saving our smaller dataset that contains only the columns we want into a dataframe called penguins_selected.

Next, we tell R which dataframe we want to use to create the smaller dataset. For us, that is the penguins dataframe.

We then include the pipe operator, which tells R to expect more from us! We want to add additional functions, specifically, the select() function.

Within select(), we list the columns that we want to select. We don’t need to put quotation marks around them, because we already told R which dataframe we’re using! For that same reason, we also don’t need to use the $ operator.

We can also use select() to remove columns:

penguins_selected2 <- penguins %>% 
  select(-flipper_length_mm, -body_mass_g)

Notice the - before each variable, or column, that we want to remove.

Both of these uses of select() result in the SAME end product!

Learn more functionality of select() here (You can even use select to rename columns!)

`filter()`

We can use filter to “subset” the data to retain just some of the observations.

This is a change to the rows! We are only keeping rows that meet a certain criteria: In this case, we are only keeping observations (rows) that are part of the group that was measured in 2008.

penguins_filtered <- penguins %>% 
  filter(year == 2008)

The == operator is a logical operator. This is notably different from = which equates two things (says x is equivalent to 2, for example). In contrast, == says “if TRUE”.

The code above tells R to filter() within the “year” column if the observation is “2008”.

Logical operators

For more information about all the logical operators you can use in filter, see this site, the Relational and Logical Operators sections. For a quick preview:

== equals, != not equals
> greater than, >= greater than or equal to
< less than, <= less than or equal to
& AND
| OR

Here is another example of how logic could be used with the filter() function:

If you wanted to create a data frame with all penguins from 2007 or 2008 but NOT those from 2009, here are two ways this might be done using logic:

penguins_filtered_no2009 <- penguins %>% 
  filter(year == 2007 | year == 2008)

I am telling R that I want to keep all penguins where the year is 2007 or the year is 2008. This means that all penguins who are not from either 2007 or 2008 are excluded. In other words, this means those from 2009 are excluded.

There’s another way to go about this:

penguins_filtered_no2009.2 <- penguins %>% filter(year != 2009)

Here, I am telling R to put in my new data frame all penguins where the year is NOT 2009. Given the levels of “year”, this means those from 2007 or 2008 are included and those from 2009 are excluded.

These two produce identical data frames with penguins only from 2007 and 2008.

There are a lot more ways you can use filter! Learn more functionality of filter here

`mutate()`

mutate() allows us to make new columns and retain the old ones! We can use this in a lot of different ways:

Let’s make a new column called “bill_sum” that adds the two bill columns:

penguins_mutated <- penguins %>% 
  mutate(bill_sum = bill_length_mm + bill_depth_mm)

In our new dataframe “penguins_mutated”, our new column “bill_sum” is the sum of “bill_length_mm” and “bill_depth_mm”.

Let’s also make the variable species into a factor:

penguins_mutated2 <- penguins %>% 
  mutate(species = factor(species))

Next, we’ll change the values of the cells in the sex column

penguins_mutated3 <- penguins %>% 
  mutate(sex_recoded = case_when(sex == "female" ~ "f", sex == "male" ~ "m"))

Here, we’re using a function called case_when(). Within the function, we tell R that we want to use an existing variable / column called “sex” to create a new variable called “sex_recoded.” As for the values of the new variable (“sex_recoded”), we want to use the existing values in “sex.” Specifically, we tell R to assign “f” to the new column (“sex_recoded”) for every observation of “female” in the “sex” column, and to assign “m” for every observation of “male” in the “sex” column.

In words rather than code, we’re telling R: “when”sex” is female, make “sex_recoded” f, and when “sex” is male, make “sex_recoded” m.

BONUS: Managing NAs:

1. Recoding NAs

Sometimes you will load data sets, and NA’s will be coded in different ways, such as -99 or “.”

You can use mutate() and the na_if() commands to recode these values into NAs to make sure R can recognize them.

THIS IS NOT THE CASE, but theoretically, if the column “body_mass_g” had -99 for its NA values, you could write your code something like this: penguins.new2 <- penguins %>% mutate(body_mass_g.NA = na_if(body_mass_g, -99))

“body_mass_g” is the variable we are looking at -99 is the values of this vector that will be replaced with NA in the new vector “body_mass_g.NA”.

2. Removing NAs

We can also use the function drop_na() to remove any rows that have an NA in any cell. We’ll utilize that in the next step.

As we’ve made these changes, we’ve created a new dataframe every time. Luckily for us, we can use the pipe operator (%>%) to link these functions together and make the changes all at once!

Making one new dataframe!

penguins_tidy <- penguins %>% 
  select(penguin, species, island, bill_length_mm, bill_depth_mm, sex, year) %>% 
  filter(year == 2008) %>% 
  mutate(bill_sum = bill_length_mm + bill_depth_mm,
         species = factor(species),
         sex_recoded = case_when(sex == "female" ~ "f", sex == "male" ~ "m")) %>% 
  drop_na()

As you write your code, test it out along the way! As long as you highlight up to (before, but not including) a pipe operator (%>%), the code will run and you can make sure you don’t have any errors. You can also run the code as you add each new function just to double-check.

To test code as you go, remember that you need to include the first line that says penguins_tidy <- penguins %>% because this is how R knows which dataframe to use. Remember that with pipes, we don’t need to indicate the dataframe to use in each line of code, which is a HUGE time-saver. But, that means that if you try to run only the line of code that starts with mutate(...), for example, you will get an error.

Make sure that your pipe operator is at the END of the line, not at the beginning. The pipe operator tells R to continue to read the next line as part of your “pipe”. When the line of code ends without a pipe, R thinks that you’re done. This is why highlighting everything BUT the %>% allows you to run all preceding code.

Save tidy data as csv

Let’s save a separate csv file with our cleaned data frame so that we don’t need to go through the “tidying” process every time we want to work with the penguins data:

write.csv(penguins_tidy, file = "penguins_cleaned.csv", row.names = FALSE)

We first name the dataframe we want to save (penguins_tidy), then indicate what we want the file to be saved as (“penguins_cleaned.csv”). row.names = FALSE tells R not to save the row numbers that it assigns as a column in the csv (View(penguins_tidy) and look on the far left to see what I’m talking about). Your tidy new csv file will automatically be saved to the same folder that your working directory is set to.

Independent practice

(1)

Load the “covid_attitudes_2023” data. Name this data frame “covid_attitudes”.

covid_attitudes <- read.csv("../covid_attitudes_2023.csv")

If you can, try to do all the ones below as a single call (e.g., only creating one new data frame).

I will list the code to do each portion below the question, but we’ll run the code altogether at the end to save everything into one dataframe.

covid_attitudes_clean <- covid_attitudes %>%

(2)

Remove the variable “Q6.consent”.

select(-Q6.consent) %>%

(3)

Keep only observations from large cities.

filter(Q84.community == "large city") %>%

(4)

Drop NAs.

drop_na() %>%

(5)

Add a new variable called “apprehension_score” that is a composite score of Q18, Q20, and Q21 (e.g., is an average of these three values)

As with most things in R, there are several ways to go about this:

mutate(apprehension_score = rowMeans(across(c(Q18.likely_to_catch_covid, Q20.ability_to_protect_self, Q21.expected_symptom_severity))) %>%

The function rowMeans() finds the mean across multiple columns and within one row. You also need the across() function within that and then c() (concatenate) to give it multiple columns to go across.

Using rowMeans(), you could also use the following code to accomplish this: mutate(apprehension_score = rowMeans(cbind(Q18.likely_to_catch_covid, Q20.ability_to_protect_self, Q21.expected_symptom_severity), na.rm=TRUE))

rowMeans() expects the input to be a dataframe, and the cbind() function fulfills that function also.

Alternatively, you could add together values from the three columns and divide by three: mutate(apprehension_score = (Q18.likely_to_catch_covid + Q20.ability_to_protect_self + Q21.expected_symptom_severity) / 3)) %>%

Choose whichever is easiest for you!

(6)

Recode the levels of a variable that has TRUE and FALSE to 1 for TRUE and 0 for FALSE, and make this variable into a factor.

mutate(Q13_1.trust_doctor_news = case_when(Q13.trust_doctor_news ==TRUE ~ 1, Q13.trust_doctor_news == FALSE ~ 0)) %>%

Since the values of this column are TRUE and FALSE, you can use them directly in case_when() and don’t need to wrap them in quotation marks! It is because these columns are logical, NOT characters!

You could also use the NOT logical operator ! to accomplish this, like so: mutate(Q13_1.trust_doctor_news = case_when(Q13.trust_doctor_news ~ 1, !Q13.trust_doctor_news ~ 0)) %>%

(7)

We want to recode data from one of the likert scales!

We want to convert “Q9.covid_knowledge” into numeric responses:

“Nothing at all” should be 1
“A little” should be 2
“A moderate amount” should be 3
“A lot” should be 4.

Save this as a new column titled “Q9.covid_knowledge_num”.

mutate(Q9.covid_knowledge_num = case_when(Q9.covid_knowledge == "Nothing at all" ~ 1, Q9.covid_knowledge == "A little" ~ 2, Q9.covid_knowledge == "A moderate amount" ~ 3, Q9.covid_knowledge == "A lot" ~ 4))

This one was a bit tricky. First, let’s take a look at the levels of this variable. Run summary(factor(covid_attitudes$Q9.covid_knowledge)) in your console to see the exact names of each level of the factor.

Be careful with how you write the responses, because they must exactly match what is already in the dataframe! How we write out these levels needs to be the same exact spacing/capitalization for R to recognize it!!

Notice that we didn’t end this code with a pipe operator. Why? Because this is the last step! We don’t want to tell R to expect more code from us in this pipe, because we’re all done. Check out the code below to see it all put together.

Now, let’s do it all at once:

covid_attitudes_tidy <- covid_attitudes %>% 
  select(-Q6.consent) %>% 
  filter(Q84.community == "large city") %>%
  drop_na() %>%
  mutate(apprehension_score = rowMeans(across(c(
    Q18.likely_to_catch_covid,
    Q20.ability_to_protect_self,
    Q21.expected_symptom_severity))),
    Q13_1.trust_doctor_news = case_when(
      Q13.trust_doctor_news ~ 1,
      !Q13.trust_doctor_news ~ 0),
    Q9.covid_knowledge_num = case_when(
      Q9.covid_knowledge == "Nothing at all" ~ 1,
      Q9.covid_knowledge == "A little" ~ 2,
      Q9.covid_knowledge == "A moderate amount" ~ 3,
      Q9.covid_knowledge == "A lot" ~ 4))

(8)

Save your data frame:

Hint: Your code should look something like this: write.csv(your_df, file= "add-your-file-name-here.csv", row.names = FALSE)

write.csv(covid_attitudes_tidy, "covid_attitudes_tidy.csv", row.names = FALSE)

Challenge questions – if you have extra time

covid_attitudes_challenge <- covid_attitudes %>%

(1)

Remove the variable Q6.consent

select(-Q6.consent) %>%

(2)

Keep observations from large cities OR suburbs

(Hint: think back to the logical operators we learned earlier today….)

filter(Q84.community == "large city" | Q84.community == "suburb")

Just for fun, let’s do it all at once!

covid_attitudes_challenge <- covid_attitudes %>% 
  select(-Q6.consent) %>%
  filter(Q84.community == "large city" | Q84.community == "suburb")

Bonus: summary statistics

We can use summarise() to get summary statistics for our variables.

Run the following code to see how it works:

attitudes_summary <- covid_attitudes_tidy %>%
  summarise(mean(apprehension_score))

Try this yourself but now get the standard deviation instead of the mean.

Hint: you can use ?summarise if you’re stuck, and/or look up how to do standard deviation in R.

attitudes_summary <- covid_attitudes_tidy %>%
  summarise(sd(apprehension_score))

Now let’s look at the average apprehension score for each age group using another powerful tidyverse command: group_by(). Run the following code:

attitudes_summary <- covid_attitudes_tidy %>%
  # group by age
  group_by(Q40.age) %>%
  # get mean apprehension score for each age group
  summarise(apprehension_mean = mean(apprehension_score)) %>%
  # it's important to ungroup!
  ungroup()

View the new variable in the “attitudes_summary” dataframe. What did it give you?
Try grouping by another variable of interest (or more than one). Don’t forget to ungroup() after you’re done !

Note: Grouping doesn’t alter your data frame, it just changes how it’s listed and how it interacts with the other commands.

Check out the tidy cheat sheet for more tidyverse and data wrangling functionality!