options(stringsAsFactors = FALSE)
R Bootcamp Session 3 Starter Code w/ Key
Set options
Warm-up
(1)
Please sign in! Go to: https://tinyurl.com/session3-signin
(2)
Make sure you have downloaded and saved the following three files:
The COVID attitudes data “covid_attitudes_2023.csv” (you should have downloaded this last session)
The penguins data “penguins.csv” (we will use this for the warm up)
The cleaned penguins data “penguins_cleaned.csv”
Note that we are not using the cleaned penguins data right now, but we will be working with it later today.
Make sure all files are saved in the appropriate location!
(3)
Load the penguins data set (the document is titled “penguins.csv”). Name it penguins
.
Hint: if it’s not working, check your working directory using the getwd()
command. Is it where you want it to be located? If not, set your working directory!
getwd()
[1] "/Users/sierrasemko/Desktop/Bootcamp_2023/session3"
# setwd("/Users/sierrasemko/Desktop/Bootcamp_2023/session3") # replace this path with yours, if you need to set your working directory!
<- read.csv("../penguins.csv") # remember that you might need to remove "../" from the file name above if your data csv is stored in the same folder as your working directory. penguins
(4)
Look at the structure of your data using the str()
command. What types of variables do you have?
str(penguins)
'data.frame': 344 obs. of 9 variables:
$ penguin : chr "p1" "p2" "p3" "p4" ...
$ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : chr "male" "female" "female" NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
I see character (for example, “penguin” and “species”), numeric (e.g., “bill_length_mm” and “bill_depth_mm”), and integer variables (e.g., “body_mass_g” and “year”) in this data set.
(5)
What is the mean body mass of the penguins (in grams)?
summary(penguins$body_mass_g)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2700 3550 4050 4202 4750 6300 2
The mean body mass (in grams) is 4202.
(6)
How many penguins were measured in each year?
Hint: you will need to make year a factor first.
$year <- factor(penguins$year)
penguins
summary(penguins$year)
2007 2008 2009
110 114 120
There were 110 measured in 2007, 114 measured in 2008, and 120 measured in 2009.
Loading packages
We want to install the package named psych
!
To install this package, there are 2 options:
Go to the “Packages” tab in the bottom right pane of RStudio. Click “Install” and type “psych” into the box and click “Install.”
Instead, you could type
install.packages("psych")
into the console (be sure that the name of the package is in quotation marks!).
You only need to install the package once! However, you will need to load the package every time you want to use it. You can do this by running this code: library(psych)
.
Think of the code install.packages("psych")
as going to the bookstore and buying a book. Until we take the book off of the shelf and open it, we can’t read it. The code library(psych)
is equivalent of opening the book to read (in this metaphor).
# install.packages("psych")
library(psych)
Check again in the “Packages” tab on the bottom-right. If there is a check / tick mark next to the “psych” package, you’re good to go!
Practice
Install and load the tidyverse package
# install.packages("tidyverse")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.4.0 ✔ purrr 0.3.4
✔ tibble 3.1.6 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
Warning: package 'tidyr' was built under R version 4.0.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%() masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Intro to tidy data
Check out our penguins
data:
View(penguins)
The goal for our tidy penguins
data frame is:
Only has columns for the variables penguin, species, island, bill_length_mm, bill_depth_mm, sex, year
Only has the observations for the year 2008
Remove penguins from the data frame that have NAs for any cells
Make a new variable called “bill_sum” that is the sum of bill_length and bill_depth
Make species into a factor
Rename the levels of the variable sex from “female” to “f” and “male” to “m”
This is what we want our data to look like after we do these processing steps:
<- read.csv("../penguins_cleaned.csv")
penguins_cleaned
View(penguins_cleaned)
Let’s use the tidyverse package to accomplish this and turn our data into tidy data!
Tidyverse
One of the key features of the tidyverse is the pipe operator %>%
.
The pipe operator allows you to string together many functions on the same data frame. You make a workflow of tasks that you perform sequentially on a data frame.
Check out the session 3 recording for a more in-depth discussion about this from Emily.
Here are the functions in the tidyverse that will allow us to adjust our data so that it becomes tidy and matches the penguins_cleaned
dataset.
select()
With this function, we can select only our columns of interest: “penguin”, “species”, “island”, “bill_length_mm”, “bill_depth_mm”, “sex”, “year.” Here, we are making changes to the columns.
<- penguins %>%
penguins_selected select(penguin, species, island, bill_length_mm, bill_depth_mm, sex, year)
We’re saving our smaller dataset that contains only the columns we want into a dataframe called penguins_selected
.
Next, we tell R which dataframe we want to use to create the smaller dataset. For us, that is the penguins
dataframe.
We then include the pipe operator, which tells R to expect more from us! We want to add additional functions, specifically, the select()
function.
Within select()
, we list the columns that we want to select. We don’t need to put quotation marks around them, because we already told R which dataframe we’re using! For that same reason, we also don’t need to use the $
operator.
We can also use select()
to remove columns:
<- penguins %>%
penguins_selected2 select(-flipper_length_mm, -body_mass_g)
Notice the -
before each variable, or column, that we want to remove.
Both of these uses of select()
result in the SAME end product!
Learn more functionality of select()
here (You can even use select to rename columns!)
filter()
We can use filter to “subset” the data to retain just some of the observations.
This is a change to the rows! We are only keeping rows that meet a certain criteria: In this case, we are only keeping observations (rows) that are part of the group that was measured in 2008.
<- penguins %>%
penguins_filtered filter(year == 2008)
The ==
operator is a logical operator. This is notably different from =
which equates two things (says x is equivalent to 2, for example). In contrast, ==
says “if TRUE”.
The code above tells R to filter()
within the “year” column if the observation is “2008”.
Logical operators
For more information about all the logical operators you can use in filter, see this site, the Relational and Logical Operators sections. For a quick preview:
==
equals,!=
not equals>
greater than,>=
greater than or equal to<
less than,<=
less than or equal to&
AND|
OR
Here is another example of how logic could be used with the filter()
function:
If you wanted to create a data frame with all penguins from 2007 or 2008 but NOT those from 2009, here are two ways this might be done using logic:
<- penguins %>%
penguins_filtered_no2009 filter(year == 2007 | year == 2008)
I am telling R that I want to keep all penguins where the year is 2007 or the year is 2008. This means that all penguins who are not from either 2007 or 2008 are excluded. In other words, this means those from 2009 are excluded.
There’s another way to go about this:
.2 <- penguins %>% filter(year != 2009) penguins_filtered_no2009
Here, I am telling R to put in my new data frame all penguins where the year is NOT 2009. Given the levels of “year”, this means those from 2007 or 2008 are included and those from 2009 are excluded.
These two produce identical data frames with penguins only from 2007 and 2008.
There are a lot more ways you can use filter! Learn more functionality of filter here
mutate()
mutate()
allows us to make new columns and retain the old ones! We can use this in a lot of different ways:
Let’s make a new column called “bill_sum” that adds the two bill columns:
<- penguins %>%
penguins_mutated mutate(bill_sum = bill_length_mm + bill_depth_mm)
In our new dataframe “penguins_mutated”, our new column “bill_sum” is the sum of “bill_length_mm” and “bill_depth_mm”.
Let’s also make the variable species into a factor:
<- penguins %>%
penguins_mutated2 mutate(species = factor(species))
Next, we’ll change the values of the cells in the sex column
<- penguins %>%
penguins_mutated3 mutate(sex_recoded = case_when(sex == "female" ~ "f", sex == "male" ~ "m"))
Here, we’re using a function called case_when()
. Within the function, we tell R that we want to use an existing variable / column called “sex” to create a new variable called “sex_recoded.” As for the values of the new variable (“sex_recoded”), we want to use the existing values in “sex.” Specifically, we tell R to assign “f” to the new column (“sex_recoded”) for every observation of “female” in the “sex” column, and to assign “m” for every observation of “male” in the “sex” column.
In words rather than code, we’re telling R: “when”sex” is female, make “sex_recoded” f, and when “sex” is male, make “sex_recoded” m.
BONUS: Managing NAs:
1. Recoding NAs
Sometimes you will load data sets, and NA’s will be coded in different ways, such as -99 or “.”
You can use mutate()
and the na_if()
commands to recode these values into NAs to make sure R can recognize them.
THIS IS NOT THE CASE, but theoretically, if the column “body_mass_g” had -99 for its NA values, you could write your code something like this: penguins.new2 <- penguins %>% mutate(body_mass_g.NA = na_if(body_mass_g, -99))
“body_mass_g” is the variable we are looking at -99 is the values of this vector that will be replaced with NA in the new vector “body_mass_g.NA”.
2. Removing NAs
We can also use the function drop_na()
to remove any rows that have an NA in any cell. We’ll utilize that in the next step.
As we’ve made these changes, we’ve created a new dataframe every time. Luckily for us, we can use the pipe operator (%>%
) to link these functions together and make the changes all at once!
Making one new dataframe!
<- penguins %>%
penguins_tidy select(penguin, species, island, bill_length_mm, bill_depth_mm, sex, year) %>%
filter(year == 2008) %>%
mutate(bill_sum = bill_length_mm + bill_depth_mm,
species = factor(species),
sex_recoded = case_when(sex == "female" ~ "f", sex == "male" ~ "m")) %>%
drop_na()
As you write your code, test it out along the way! As long as you highlight up to (before, but not including) a pipe operator (%>%
), the code will run and you can make sure you don’t have any errors. You can also run the code as you add each new function just to double-check.
To test code as you go, remember that you need to include the first line that says penguins_tidy <- penguins %>%
because this is how R knows which dataframe to use. Remember that with pipes, we don’t need to indicate the dataframe to use in each line of code, which is a HUGE time-saver. But, that means that if you try to run only the line of code that starts with mutate(...)
, for example, you will get an error.
Make sure that your pipe operator is at the END of the line, not at the beginning. The pipe operator tells R to continue to read the next line as part of your “pipe”. When the line of code ends without a pipe, R thinks that you’re done. This is why highlighting everything BUT the %>%
allows you to run all preceding code.
Save tidy data as csv
Let’s save a separate csv file with our cleaned data frame so that we don’t need to go through the “tidying” process every time we want to work with the penguins data:
write.csv(penguins_tidy, file = "penguins_cleaned.csv", row.names = FALSE)
We first name the dataframe we want to save (penguins_tidy
), then indicate what we want the file to be saved as (“penguins_cleaned.csv”). row.names = FALSE
tells R not to save the row numbers that it assigns as a column in the csv (View(penguins_tidy)
and look on the far left to see what I’m talking about). Your tidy new csv file will automatically be saved to the same folder that your working directory is set to.
Independent practice
(1)
Load the “covid_attitudes_2023” data. Name this data frame “covid_attitudes”.
<- read.csv("../covid_attitudes_2023.csv") covid_attitudes
If you can, try to do all the ones below as a single call (e.g., only creating one new data frame).
I will list the code to do each portion below the question, but we’ll run the code altogether at the end to save everything into one dataframe.
covid_attitudes_clean <- covid_attitudes %>%
(2)
Remove the variable “Q6.consent”.
select(-Q6.consent) %>%
(3)
Keep only observations from large cities.
filter(Q84.community == "large city") %>%
(4)
Drop NAs.
drop_na() %>%
(5)
Add a new variable called “apprehension_score” that is a composite score of Q18, Q20, and Q21 (e.g., is an average of these three values)
As with most things in R, there are several ways to go about this:
mutate(apprehension_score = rowMeans(across(c(Q18.likely_to_catch_covid, Q20.ability_to_protect_self, Q21.expected_symptom_severity))) %>%
The function rowMeans()
finds the mean across multiple columns and within one row. You also need the across()
function within that and then c()
(concatenate) to give it multiple columns to go across.
- Using
rowMeans()
, you could also use the following code to accomplish this:mutate(apprehension_score = rowMeans(cbind(Q18.likely_to_catch_covid, Q20.ability_to_protect_self, Q21.expected_symptom_severity), na.rm=TRUE))
rowMeans()
expects the input to be a dataframe, and the cbind()
function fulfills that function also.
- Alternatively, you could add together values from the three columns and divide by three:
mutate(apprehension_score = (Q18.likely_to_catch_covid + Q20.ability_to_protect_self + Q21.expected_symptom_severity) / 3)) %>%
Choose whichever is easiest for you!
(6)
Recode the levels of a variable that has TRUE and FALSE to 1 for TRUE and 0 for FALSE, and make this variable into a factor.
mutate(Q13_1.trust_doctor_news = case_when(Q13.trust_doctor_news ==TRUE ~ 1, Q13.trust_doctor_news == FALSE ~ 0)) %>%
Since the values of this column are TRUE and FALSE, you can use them directly in case_when()
and don’t need to wrap them in quotation marks! It is because these columns are logical, NOT characters!
You could also use the NOT logical operator !
to accomplish this, like so: mutate(Q13_1.trust_doctor_news = case_when(Q13.trust_doctor_news ~ 1, !Q13.trust_doctor_news ~ 0)) %>%
(7)
We want to recode data from one of the likert scales!
We want to convert “Q9.covid_knowledge” into numeric responses:
“Nothing at all” should be 1
“A little” should be 2
“A moderate amount” should be 3
“A lot” should be 4.
Save this as a new column titled “Q9.covid_knowledge_num”.
mutate(Q9.covid_knowledge_num = case_when(Q9.covid_knowledge == "Nothing at all" ~ 1, Q9.covid_knowledge == "A little" ~ 2, Q9.covid_knowledge == "A moderate amount" ~ 3, Q9.covid_knowledge == "A lot" ~ 4))
This one was a bit tricky. First, let’s take a look at the levels of this variable. Run summary(factor(covid_attitudes$Q9.covid_knowledge))
in your console to see the exact names of each level of the factor.
Be careful with how you write the responses, because they must exactly match what is already in the dataframe! How we write out these levels needs to be the same exact spacing/capitalization for R to recognize it!!
Notice that we didn’t end this code with a pipe operator. Why? Because this is the last step! We don’t want to tell R to expect more code from us in this pipe, because we’re all done. Check out the code below to see it all put together.
Now, let’s do it all at once:
<- covid_attitudes %>%
covid_attitudes_tidy select(-Q6.consent) %>%
filter(Q84.community == "large city") %>%
drop_na() %>%
mutate(apprehension_score = rowMeans(across(c(
Q18.likely_to_catch_covid,
Q20.ability_to_protect_self,
Q21.expected_symptom_severity))),Q13_1.trust_doctor_news = case_when(
~ 1,
Q13.trust_doctor_news !Q13.trust_doctor_news ~ 0),
Q9.covid_knowledge_num = case_when(
== "Nothing at all" ~ 1,
Q9.covid_knowledge == "A little" ~ 2,
Q9.covid_knowledge == "A moderate amount" ~ 3,
Q9.covid_knowledge == "A lot" ~ 4)) Q9.covid_knowledge
(8)
Save your data frame:
Hint: Your code should look something like this: write.csv(your_df, file= "add-your-file-name-here.csv", row.names = FALSE)
write.csv(covid_attitudes_tidy, "covid_attitudes_tidy.csv", row.names = FALSE)
Challenge questions – if you have extra time
covid_attitudes_challenge <- covid_attitudes %>%
(1)
Remove the variable Q6.consent
select(-Q6.consent) %>%
(2)
Keep observations from large cities OR suburbs
(Hint: think back to the logical operators we learned earlier today….)
filter(Q84.community == "large city" | Q84.community == "suburb")
Just for fun, let’s do it all at once!
<- covid_attitudes %>%
covid_attitudes_challenge select(-Q6.consent) %>%
filter(Q84.community == "large city" | Q84.community == "suburb")
Bonus: summary statistics
We can use summarise()
to get summary statistics for our variables.
Run the following code to see how it works:
<- covid_attitudes_tidy %>%
attitudes_summary summarise(mean(apprehension_score))
Try this yourself but now get the standard deviation instead of the mean.
Hint: you can use ?summarise
if you’re stuck, and/or look up how to do standard deviation in R.
<- covid_attitudes_tidy %>%
attitudes_summary summarise(sd(apprehension_score))
Now let’s look at the average apprehension score for each age group using another powerful tidyverse command: group_by()
. Run the following code:
<- covid_attitudes_tidy %>%
attitudes_summary # group by age
group_by(Q40.age) %>%
# get mean apprehension score for each age group
summarise(apprehension_mean = mean(apprehension_score)) %>%
# it's important to ungroup!
ungroup()
View the new variable in the “attitudes_summary” dataframe. What did it give you?
Try grouping by another variable of interest (or more than one). Don’t forget to ungroup() after you’re done !
Note: Grouping doesn’t alter your data frame, it just changes how it’s listed and how it interacts with the other commands.
Check out the tidy cheat sheet for more tidyverse and data wrangling functionality!