R Bootcamp Session 2 Key

Author

Sierra Semko Krouse & Emily Rosenthal

Published

June 21, 2023

Warm-up

Create three different vectors:

  1. A vector called “names” with three names
names <- c("Sarah", "Robert", "Sally")
  1. A vector named “ages” with three ages of college students
ages <- c(18, 20, 21)
  1. A vector called “year” with three years of college (e.g., Freshman, Sophomore, etc.)
year <- c("freshman", "junior", "senior")
year.factor <- factor(year, 
               levels = c("freshman", "sophomore", "junior", "senior"))

We made a new vector called “year.factor” that is the factorized version of the vector “year”. The reason we make a factor from “year” is because there are four possible options/bins for year - freshman, sophomore, junior, or senior. Although we only OBSERVED three of these in the data, there are four possible options (e.g., sophomores still exist even if we didn’t get data from one of them).

The difference in our two year factors is evident when ask R to print out the content of each factor. Below, we can see that “year.factor” has levels, which is the tell-tale sign of a factor:

year.factor
[1] freshman junior   senior  
Levels: freshman sophomore junior senior

whereas the “year” vector is merely a character vector (a vector composed of a character items):

year
[1] "freshman" "junior"   "senior"  

NOTE: You could do this in one step if you wanted to:

year <- factor(c("freshman", "junior", "senior"),
               levels = c("freshman", "sophomore", "junior", "senior"))

Run the code and check that everything looks correct in the global environment.

Dataframes

Let’s make a dataframe called “students” with our new vectors:

students <- data.frame(names, ages, year)

If you get an error that says something like “arguments imply differing number of rows:”, check that each of the vectors you made has the same number of items – e.g., that you have a list of EXACTLY three names, three ages, and three years. You can’t make a dataframe from vectors of varying lengths.

We can look at our dataframe in a new tab:

View(students)

We can also look at our data frame in the global environment. We see that each of our vectors became columns, and R is smart enough to make our column names the names of the vectors!

We can use ‘$’ to examine individual columns within our larger dataset:

students$names
[1] "Sarah"  "Robert" "Sally" 
students$ages
[1] 18 20 21
students$year
[1] freshman junior   senior  
Levels: freshman sophomore junior senior

If we run these lines of code, we see that each column is a vector!

Normally, we don’t create our own data frames by hand though because we have collected data that we have stored in a file that we want to read in to work with!

But first, before we read in any data, we need to understand the structure of our file directory and understand where our R script “lives” (i.e., which folder it is in), where our data file lives, and where our R script is pointing to.

Working Directory

The working directory is the file folder that R is currently working from. The working directory is also where R will look for any new data file, and where R will default to saving any new files.

There is a command to check where the working directory is currently set:

getwd()
[1] "/Users/sierrasemko/Desktop/Bootcamp_2023/session2"

You can also see what the working directory is by looking at the bar below the tabs that say “Console,” “Terminal,” and “Jobs” and above the console panel.

If your data file is in your current working directory (which often means it is in the same folder as the R script you are working with), you can just write read.csv("name-of-file.csv"). You don’t need to give R any other “address” information from the file because R is already in the right place to look for it!

Some ways to set your working directory:

    1. [When R is closed] - Open your script from the file/library on your computer
    1. Point and click in R Studio – At the top of your screen, go to: Session >> Set working directory >> Choose directory >> select a folder
    1. Use the setwd() function – You would put the absolute file path here. On PC, this would look like:

setwd("C:\\Users\\Emily\\Documents\\R_Bootcamp\\2023_Summer\\Session2").

On Mac, this would look like:

setwd(/Users/sierrasemko/Desktop/Bootcamp_2023/session2).

Note: If you have a PC, you either need two dashes like \ or one dash in the other direction /. This will differ for each person - your file paths will be different than mine! Using this absolute path is tricky if you ever change the names of your files.

Regardless of how you set your working directory, just be careful and make sure you always know where your R script is located, where your data file is located, and where the working directory is.

If this is a bit confusing, check out this linked video we made on setting your working directory.

Note that you will need to set your working directory EVERY TIME you open an R script!

Load Data

Let’s read in our data and save it in a dataframe called “covid_attitudes”. There are two ways to go about loading a dataset into R Studio:

Option 1) If your data “lives” one level above your working directory (e.g., if your working directory is set to the Session 2 folder, but your data file ‘lives’ one level up in the R Bootcamp 2023 folder).

This would be the case if you have a Bootcamp_2023 folder where the data lives, but within that folder you have subfolders for each workshop session. If your working directory is set to the Session 2 subfolder, for example, you need to indicate to R to look at the folder one level higher to find the data file (e.g., the script / working directory is set to the classroom, but you need it to look in the hallway for your data file if we want to use the analogy we discussed earlier).

Indicate this to R by ../ right before the name of the data file, which says, “look one folder higher”:

covid_attitudes <- read.csv("../covid_attitudes_2023.csv")

Option 2) If your working directory is the file folder where your document lives (e.g., right to your R Bootcamp 2023 folder)

# covid_attitudes <- read.csv("covid_attitudes_2023.csv")

You will only need to do one of these options, and it will all depend on a) where your working directory is currently set, and b) where the data file that you want to read into R is currently stored.

This should appear in your global environment!

Note: If my CSV file was saved in the same folder as my working directory, I would read it in with this code: covid_attitudes <- read.csv(“covid_attitudes.csv”), without the “../”

Bonus/Extra – Reading in SPSS (.sav) Data Files

Some labs have data files saved in SPSS, and they will appear in your folder as .sav files.

We didn’t cover this during the recording, but this might be helpful for you if this applies to your lab!

  1. Install the haven package
# install.packages('haven')

You only need to run this line of code once. After that, this package will be installed (make sure you take out the # before you run it the first time, but no need to run it again once it’s been installed).

  1. Load the package!
# library(haven)

You need to run this every time you run your code when loading “.sav” files, and this should be run BEFORE loading in your data

  1. Make a data frame with your data!
# data_frame_name<-read_sav("data_file_name.sav") 

Setting Options

Older versions of R would automatically make columns with characters into factors, which is not what we want.

So, let’s tell R not to make this assumption when it loads in data!

options(stringsAsFActors = FALSE) 

Examining Data

If you add a question mark before a command, R can provide some information about what it does (look in the bottom right corner of your screen!

For instance, what do these do?:

?ncol 
?nrow
?str
?summary

Remember: if you want to call a specific vector (column) from your data frame, use the $ operator!

e.g., covid_attitudes$Q74.education – this calls the Q74.education column from our covid_attitudes data frame

On Your Own

(1)

(a)

Use the summary() function to learn about the covid_attitudes dataframe. This should look like: summary(covid_attitudes).

summary(covid_attitudes)
     sub_ID         Q6.consent Q8.covid.info      Q9.covid_knowledge
 Min.   :   1.0   Min.   :1    Length:1020        Length:1020       
 1st Qu.: 255.8   1st Qu.:1    Class :character   Class :character  
 Median : 510.5   Median :1    Mode  :character   Mode  :character  
 Mean   : 510.5   Mean   :1                                         
 3rd Qu.: 765.2   3rd Qu.:1                                         
 Max.   :1020.0   Max.   :1                                         
                  NA's   :47                                        
 Q10.rank_attention_to_news Q13.trust_doctor_news Q13.trust_hospital_news
 Min.   : 0.000             Mode :logical         Mode :logical          
 1st Qu.: 5.000             FALSE:895             FALSE:939              
 Median : 7.000             TRUE :125             TRUE :81               
 Mean   : 6.792                                                          
 3rd Qu.: 8.000                                                          
 Max.   :10.000                                                          
 NA's   :120                                                             
 Q13.trust_cdc_news Q13.trust_government_news Q13.trust_none_of_above
 Mode :logical      Mode :logical             Length:1020            
 FALSE:369          FALSE:937                 Class :character       
 TRUE :651          TRUE :83                  Mode  :character       
                                                                     
                                                                     
                                                                     
                                                                     
 Q14.confidence_us_government Q16.Belief_scientists_understand_covid
 Min.   : 0.000               Min.   : 0.000                        
 1st Qu.: 2.000               1st Qu.: 5.000                        
 Median : 4.000               Median : 6.000                        
 Mean   : 3.968               Mean   : 5.961                        
 3rd Qu.: 6.000               3rd Qu.: 7.000                        
 Max.   :10.000               Max.   :10.000                        
 NA's   :141                  NA's   :137                           
 Q17.concerned_outbreak Q18.likely_to_catch_covid Q20.ability_to_protect_self
 Length:1020            Min.   :  0.00            Min.   :  0.00             
 Class :character       1st Qu.: 20.00            1st Qu.: 61.00             
 Mode  :character       Median : 41.00            Median : 77.00             
                        Mean   : 42.16            Mean   : 73.53             
                        3rd Qu.: 60.00            3rd Qu.: 90.00             
                        Max.   :100.00            Max.   :100.00             
                        NA's   :153               NA's   :150                
 Q21.expected_symptom_severity Q101.confidence_in_authority
 Min.   :1.000                 Min.   :  0.00              
 1st Qu.:2.000                 1st Qu.: 29.00              
 Median :3.000                 Median : 47.00              
 Mean   :2.857                 Mean   : 45.15              
 3rd Qu.:3.000                 3rd Qu.: 62.00              
 Max.   :6.000                 Max.   :100.00              
 NA's   :154                   NA's   :157                 
 Q23.risk_exaggerated Q27.wearing_mask   Q35.take_vaccine.    Q40.age         
 Length:1020          Length:1020        Length:1020        Length:1020       
 Class :character     Class :character   Class :character   Class :character  
 Mode  :character     Mode  :character   Mode  :character   Mode  :character  
                                                                              
                                                                              
                                                                              
                                                                              
  Q41.gender        Q74.education      Q84.community      Q43.health.1.5.
 Length:1020        Length:1020        Length:1020        Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:3.000  
 Mode  :character   Mode  :character   Mode  :character   Median :4.000  
                                                          Mean   :3.552  
                                                          3rd Qu.:4.000  
                                                          Max.   :5.000  
                                                          NA's   :190    
 Q39.flu.vaccine    Q44.contact_covid  Q85.pre.exisiting_conditions
 Length:1020        Length:1020        Length:1020                 
 Class :character   Class :character   Class :character            
 Mode  :character   Mode  :character   Mode  :character            
                                                                   
                                                                   
                                                                   
                                                                   
 Q92.friends.family_preexisiting_conditions
 Length:1020                               
 Class :character                          
 Mode  :character                          
                                           
                                           
                                           
                                           

When called on a data frame, this command gives you summary information for each column.

  • For numeric columns, gives you mean, median, and quartile information.
  • For character columns, does not give you much information, just the length and type.
  • For factors, gives much more information: shows all the levels (i.e., sub-groups), and also how many observations are in each group.

(b)

Now use the use the str() function in the same way. What do these do? What do you see? What types of data do we have in our data frame?

str(covid_attitudes)
'data.frame':   1020 obs. of  29 variables:
 $ sub_ID                                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Q6.consent                                : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Q8.covid.info                             : chr  "A moderate amount" "A lot" "A lot" "A moderate amount" ...
 $ Q9.covid_knowledge                        : chr  NA "A little" "A moderate amount" "A moderate amount" ...
 $ Q10.rank_attention_to_news                : int  3 6 8 4 NA 5 8 9 9 NA ...
 $ Q13.trust_doctor_news                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Q13.trust_hospital_news                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Q13.trust_cdc_news                        : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
 $ Q13.trust_government_news                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Q13.trust_none_of_above                   : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
 $ Q14.confidence_us_government              : int  NA 6 5 4 NA 3 5 3 8 NA ...
 $ Q16.Belief_scientists_understand_covid    : int  NA 5 9 3 NA 4 6 8 10 NA ...
 $ Q17.concerned_outbreak                    : chr  "A moderate amount" "A little" "A moderate amount" NA ...
 $ Q18.likely_to_catch_covid                 : int  NA 29 50 9 NA 22 40 69 85 NA ...
 $ Q20.ability_to_protect_self               : int  NA 50 70 49 NA 81 61 70 100 NA ...
 $ Q21.expected_symptom_severity             : int  NA 4 2 3 NA 3 3 3 2 NA ...
 $ Q101.confidence_in_authority              : int  NA 46 40 30 NA 50 60 40 80 NA ...
 $ Q23.risk_exaggerated                      : chr  NA "somewhat disagree" NA "strongly disagree" ...
 $ Q27.wearing_mask                          : chr  NA "sick people" "sick people" "everyone" ...
 $ Q35.take_vaccine.                         : chr  NA "definitely" "probably" "definitely" ...
 $ Q40.age                                   : chr  NA "30-34" "25-29" "50-54" ...
 $ Q41.gender                                : chr  NA "f" "f" "m" ...
 $ Q74.education                             : chr  NA "doctorate" "professional degree" "doctorate" ...
 $ Q84.community                             : chr  NA "large city" "ruralArea" "small city/town" ...
 $ Q43.health.1.5.                           : int  3 3 3 3 NA 4 3 5 3 NA ...
 $ Q39.flu.vaccine                           : chr  "FALSE" "TRUE" "FALSE" "TRUE" ...
 $ Q44.contact_covid                         : chr  "no" "unsure" "no" "no" ...
 $ Q85.pre.exisiting_conditions              : chr  "prefer not to say" "no" "no" "no" ...
 $ Q92.friends.family_preexisiting_conditions: chr  "no" "no" "no" "yes" ...

str stands for “structure”. This function gives you the structure of the data frame, which is similar to the information that the global environment gives you. It shows you all of the columns and what type they are. It also gives you a sample of the data in each column.

This data frame has variables that are integers (a type of numeric variable), characters, numeric, and logical.

(2)

Use the ncol() function. How many columns in our data?

ncol(covid_attitudes) # this function gives us the number of columns in the data frame
[1] 29

There are 29 columns.

Note: If you check the output of your str() function or in the global environment, this is equal to the number of variables.

(3)

Use the nrow() function. How many rows in our data?

nrow(covid_attitudes) # this function gives us the number of rows in our data frame
[1] 1020

There are 1020 rows.

Note: If you check the output of your str() function or in the global environment, this is equal to the number of obs. (which stands for observations)

Also Note: This does not count the row with the col names, it knows that those are the names for the variables/columns.

(4)

(a)

Some of these columns should be factors! Lets turn the education column into a factor. Use the command factor(covid_attitudes$Q74.education) but be sure to save the factor as a column in your dataframe:

covid_attitudes$Q74.education.factor <- factor(covid_attitudes$Q74.education)

When we create a factor, we want to make sure we are creating a new column within our data frame where that can live. To make a new column within our data frame, we have the name of our data frame, our $, and then the name we are making for this new column.

(b)

Pick some other column you think should be a factor and turn it into a factor:

covid_attitudes$Q8.covid.info.factor <- factor(covid_attitudes$Q8.covid.info)

(5)

Now use the summary() command to run on summary on just the column with your new factor. Does the description change from how it was before? What does it look like now?

summary(covid_attitudes$Q74.education.factor)
       4 year degree        2 year degree            doctorate 
                 159                   76                  112 
 highschool graduate less than highschool  professional degree 
                  91                    7                   86 
        some college                 NA's 
                 301                  188 

Before we made it a factor, the summary() function just told us that it was a character vector and had a length of 1020 (which is the number of observations)

Now that it is a factor, we can see more information! We now see the levels and the number of observations (people) who reported each level of education (and NA).

(6)

Which columns have the most NAs? Use the summary() command to investigate!

summary(covid_attitudes)
     sub_ID         Q6.consent Q8.covid.info      Q9.covid_knowledge
 Min.   :   1.0   Min.   :1    Length:1020        Length:1020       
 1st Qu.: 255.8   1st Qu.:1    Class :character   Class :character  
 Median : 510.5   Median :1    Mode  :character   Mode  :character  
 Mean   : 510.5   Mean   :1                                         
 3rd Qu.: 765.2   3rd Qu.:1                                         
 Max.   :1020.0   Max.   :1                                         
                  NA's   :47                                        
 Q10.rank_attention_to_news Q13.trust_doctor_news Q13.trust_hospital_news
 Min.   : 0.000             Mode :logical         Mode :logical          
 1st Qu.: 5.000             FALSE:895             FALSE:939              
 Median : 7.000             TRUE :125             TRUE :81               
 Mean   : 6.792                                                          
 3rd Qu.: 8.000                                                          
 Max.   :10.000                                                          
 NA's   :120                                                             
 Q13.trust_cdc_news Q13.trust_government_news Q13.trust_none_of_above
 Mode :logical      Mode :logical             Length:1020            
 FALSE:369          FALSE:937                 Class :character       
 TRUE :651          TRUE :83                  Mode  :character       
                                                                     
                                                                     
                                                                     
                                                                     
 Q14.confidence_us_government Q16.Belief_scientists_understand_covid
 Min.   : 0.000               Min.   : 0.000                        
 1st Qu.: 2.000               1st Qu.: 5.000                        
 Median : 4.000               Median : 6.000                        
 Mean   : 3.968               Mean   : 5.961                        
 3rd Qu.: 6.000               3rd Qu.: 7.000                        
 Max.   :10.000               Max.   :10.000                        
 NA's   :141                  NA's   :137                           
 Q17.concerned_outbreak Q18.likely_to_catch_covid Q20.ability_to_protect_self
 Length:1020            Min.   :  0.00            Min.   :  0.00             
 Class :character       1st Qu.: 20.00            1st Qu.: 61.00             
 Mode  :character       Median : 41.00            Median : 77.00             
                        Mean   : 42.16            Mean   : 73.53             
                        3rd Qu.: 60.00            3rd Qu.: 90.00             
                        Max.   :100.00            Max.   :100.00             
                        NA's   :153               NA's   :150                
 Q21.expected_symptom_severity Q101.confidence_in_authority
 Min.   :1.000                 Min.   :  0.00              
 1st Qu.:2.000                 1st Qu.: 29.00              
 Median :3.000                 Median : 47.00              
 Mean   :2.857                 Mean   : 45.15              
 3rd Qu.:3.000                 3rd Qu.: 62.00              
 Max.   :6.000                 Max.   :100.00              
 NA's   :154                   NA's   :157                 
 Q23.risk_exaggerated Q27.wearing_mask   Q35.take_vaccine.    Q40.age         
 Length:1020          Length:1020        Length:1020        Length:1020       
 Class :character     Class :character   Class :character   Class :character  
 Mode  :character     Mode  :character   Mode  :character   Mode  :character  
                                                                              
                                                                              
                                                                              
                                                                              
  Q41.gender        Q74.education      Q84.community      Q43.health.1.5.
 Length:1020        Length:1020        Length:1020        Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:3.000  
 Mode  :character   Mode  :character   Mode  :character   Median :4.000  
                                                          Mean   :3.552  
                                                          3rd Qu.:4.000  
                                                          Max.   :5.000  
                                                          NA's   :190    
 Q39.flu.vaccine    Q44.contact_covid  Q85.pre.exisiting_conditions
 Length:1020        Length:1020        Length:1020                 
 Class :character   Class :character   Class :character            
 Mode  :character   Mode  :character   Mode  :character            
                                                                   
                                                                   
                                                                   
                                                                   
 Q92.friends.family_preexisiting_conditions          Q74.education.factor
 Length:1020                                some college       :301      
 Class :character                            4 year degree     :159      
 Mode  :character                           doctorate          :112      
                                            highschool graduate: 91      
                                            professional degree: 86      
                                            (Other)            : 83      
                                            NA's               :188      
        Q8.covid.info.factor
 A little         : 33      
 A lot            :602      
 A moderate amount:268      
 NA's             :117      
                            
                            
                            

This depends on which you made into factors! For example.

  • Q43 has 190 NAs! (it is a numeric variable, so will show NAs)

  • Q84 has 188 NAs (but we only know this if we turned it into a factor!

(7)

What do you notice about how the NAs are represented in different columns? In summary()? When you View() it?

# summary(covid_attitudes)
# View(covid_attitudes)

I notice that for the column covid_attitudes$Q13.trust_none_of_above, missing data is indicated by a blank cell, wheareas for others they are indicated as italicised NA’s. Thise italicized NA’s come up as “NA’s” in the summary.

Different labs format missing data differently and it is something to keep an eye out for in the future as it’s really important to make sure that it is formatted properly.

(8)

On average, how likely do people think they are to catch Covid-19? Hint: You first need to find the relevant column to answer this question, we first have to find the relevant column: Q_18

Then, use summary() to find the mean:

summary(covid_attitudes$Q18.likely_to_catch_covid)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   20.00   41.00   42.16   60.00  100.00     153 

The mean is 42.16, so on average, people think they have a 42.16% chance of catching Covid.

(9)

How many types of living communities are there? To answer this question, we first have to find the relevant column: Q_84.community.

If you haven’t already, you will need to turn it into a factor:

covid_attitudes$Q84.community.factor <- factor(covid_attitudes$Q84.community)

Then, use summary() to see the levels of the factor, as well as how many people or observations are in each level:

summary(covid_attitudes$Q84.community.factor)
     large city       ruralArea small city/town          suburb            NA's 
            150              66             260             356             188 

There are 4 types of communities and NA.

(10)

If you have time: Think of another question you can ask and answer it!