<- c("Sarah", "Robert", "Sally") names
R Bootcamp Session 2 Key
Warm-up
Create three different vectors:
- A vector called “names” with three names
- A vector named “ages” with three ages of college students
<- c(18, 20, 21) ages
- A vector called “year” with three years of college (e.g., Freshman, Sophomore, etc.)
<- c("freshman", "junior", "senior")
year <- factor(year,
year.factor levels = c("freshman", "sophomore", "junior", "senior"))
We made a new vector called “year.factor” that is the factorized version of the vector “year”. The reason we make a factor from “year” is because there are four possible options/bins for year - freshman, sophomore, junior, or senior. Although we only OBSERVED three of these in the data, there are four possible options (e.g., sophomores still exist even if we didn’t get data from one of them).
The difference in our two year factors is evident when ask R to print out the content of each factor. Below, we can see that “year.factor” has levels, which is the tell-tale sign of a factor:
year.factor
[1] freshman junior senior
Levels: freshman sophomore junior senior
whereas the “year” vector is merely a character vector (a vector composed of a character items):
year
[1] "freshman" "junior" "senior"
NOTE: You could do this in one step if you wanted to:
<- factor(c("freshman", "junior", "senior"),
year levels = c("freshman", "sophomore", "junior", "senior"))
Run the code and check that everything looks correct in the global environment.
Dataframes
Let’s make a dataframe called “students” with our new vectors:
<- data.frame(names, ages, year) students
If you get an error that says something like “arguments imply differing number of rows:”, check that each of the vectors you made has the same number of items – e.g., that you have a list of EXACTLY three names, three ages, and three years. You can’t make a dataframe from vectors of varying lengths.
We can look at our dataframe in a new tab:
View(students)
We can also look at our data frame in the global environment. We see that each of our vectors became columns, and R is smart enough to make our column names the names of the vectors!
We can use ‘$’ to examine individual columns within our larger dataset:
$names students
[1] "Sarah" "Robert" "Sally"
$ages students
[1] 18 20 21
$year students
[1] freshman junior senior
Levels: freshman sophomore junior senior
If we run these lines of code, we see that each column is a vector!
Normally, we don’t create our own data frames by hand though because we have collected data that we have stored in a file that we want to read in to work with!
But first, before we read in any data, we need to understand the structure of our file directory and understand where our R script “lives” (i.e., which folder it is in), where our data file lives, and where our R script is pointing to.
Working Directory
The working directory is the file folder that R is currently working from. The working directory is also where R will look for any new data file, and where R will default to saving any new files.
There is a command to check where the working directory is currently set:
getwd()
[1] "/Users/sierrasemko/Desktop/Bootcamp_2023/session2"
You can also see what the working directory is by looking at the bar below the tabs that say “Console,” “Terminal,” and “Jobs” and above the console panel.
If your data file is in your current working directory (which often means it is in the same folder as the R script you are working with), you can just write read.csv("name-of-file.csv")
. You don’t need to give R any other “address” information from the file because R is already in the right place to look for it!
Some ways to set your working directory:
- [When R is closed] - Open your script from the file/library on your computer
- Point and click in R Studio – At the top of your screen, go to: Session >> Set working directory >> Choose directory >> select a folder
- Use the setwd() function – You would put the absolute file path here. On PC, this would look like:
setwd("C:\\Users\\Emily\\Documents\\R_Bootcamp\\2023_Summer\\Session2")
.
On Mac, this would look like:
setwd(/Users/sierrasemko/Desktop/Bootcamp_2023/session2)
.
Note: If you have a PC, you either need two dashes like \ or one dash in the other direction /. This will differ for each person - your file paths will be different than mine! Using this absolute path is tricky if you ever change the names of your files.
Regardless of how you set your working directory, just be careful and make sure you always know where your R script is located, where your data file is located, and where the working directory is.
If this is a bit confusing, check out this linked video we made on setting your working directory.
Note that you will need to set your working directory EVERY TIME you open an R script!
Load Data
Let’s read in our data and save it in a dataframe called “covid_attitudes”. There are two ways to go about loading a dataset into R Studio:
Option 1) If your data “lives” one level above your working directory (e.g., if your working directory is set to the Session 2 folder, but your data file ‘lives’ one level up in the R Bootcamp 2023 folder).
This would be the case if you have a Bootcamp_2023 folder where the data lives, but within that folder you have subfolders for each workshop session. If your working directory is set to the Session 2 subfolder, for example, you need to indicate to R to look at the folder one level higher to find the data file (e.g., the script / working directory is set to the classroom, but you need it to look in the hallway for your data file if we want to use the analogy we discussed earlier).
Indicate this to R by ../
right before the name of the data file, which says, “look one folder higher”:
<- read.csv("../covid_attitudes_2023.csv") covid_attitudes
Option 2) If your working directory is the file folder where your document lives (e.g., right to your R Bootcamp 2023 folder)
# covid_attitudes <- read.csv("covid_attitudes_2023.csv")
You will only need to do one of these options, and it will all depend on a) where your working directory is currently set, and b) where the data file that you want to read into R is currently stored.
This should appear in your global environment!
Note: If my CSV file was saved in the same folder as my working directory, I would read it in with this code: covid_attitudes <- read.csv(“covid_attitudes.csv”), without the “../”
Bonus/Extra – Reading in SPSS (.sav) Data Files
Some labs have data files saved in SPSS, and they will appear in your folder as .sav files.
We didn’t cover this during the recording, but this might be helpful for you if this applies to your lab!
- Install the haven package
# install.packages('haven')
You only need to run this line of code once. After that, this package will be installed (make sure you take out the # before you run it the first time, but no need to run it again once it’s been installed).
- Load the package!
# library(haven)
You need to run this every time you run your code when loading “.sav” files, and this should be run BEFORE loading in your data
- Make a data frame with your data!
# data_frame_name<-read_sav("data_file_name.sav")
Setting Options
Older versions of R would automatically make columns with characters into factors, which is not what we want.
So, let’s tell R not to make this assumption when it loads in data!
options(stringsAsFActors = FALSE)
Examining Data
If you add a question mark before a command, R can provide some information about what it does (look in the bottom right corner of your screen!
For instance, what do these do?:
?ncol
?nrow
?str ?summary
Remember: if you want to call a specific vector (column) from your data frame, use the $ operator!
e.g., covid_attitudes$Q74.education
– this calls the Q74.education column from our covid_attitudes data frame
On Your Own
(1)
(a)
Use the summary()
function to learn about the covid_attitudes dataframe. This should look like: summary(covid_attitudes)
.
summary(covid_attitudes)
sub_ID Q6.consent Q8.covid.info Q9.covid_knowledge
Min. : 1.0 Min. :1 Length:1020 Length:1020
1st Qu.: 255.8 1st Qu.:1 Class :character Class :character
Median : 510.5 Median :1 Mode :character Mode :character
Mean : 510.5 Mean :1
3rd Qu.: 765.2 3rd Qu.:1
Max. :1020.0 Max. :1
NA's :47
Q10.rank_attention_to_news Q13.trust_doctor_news Q13.trust_hospital_news
Min. : 0.000 Mode :logical Mode :logical
1st Qu.: 5.000 FALSE:895 FALSE:939
Median : 7.000 TRUE :125 TRUE :81
Mean : 6.792
3rd Qu.: 8.000
Max. :10.000
NA's :120
Q13.trust_cdc_news Q13.trust_government_news Q13.trust_none_of_above
Mode :logical Mode :logical Length:1020
FALSE:369 FALSE:937 Class :character
TRUE :651 TRUE :83 Mode :character
Q14.confidence_us_government Q16.Belief_scientists_understand_covid
Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 5.000
Median : 4.000 Median : 6.000
Mean : 3.968 Mean : 5.961
3rd Qu.: 6.000 3rd Qu.: 7.000
Max. :10.000 Max. :10.000
NA's :141 NA's :137
Q17.concerned_outbreak Q18.likely_to_catch_covid Q20.ability_to_protect_self
Length:1020 Min. : 0.00 Min. : 0.00
Class :character 1st Qu.: 20.00 1st Qu.: 61.00
Mode :character Median : 41.00 Median : 77.00
Mean : 42.16 Mean : 73.53
3rd Qu.: 60.00 3rd Qu.: 90.00
Max. :100.00 Max. :100.00
NA's :153 NA's :150
Q21.expected_symptom_severity Q101.confidence_in_authority
Min. :1.000 Min. : 0.00
1st Qu.:2.000 1st Qu.: 29.00
Median :3.000 Median : 47.00
Mean :2.857 Mean : 45.15
3rd Qu.:3.000 3rd Qu.: 62.00
Max. :6.000 Max. :100.00
NA's :154 NA's :157
Q23.risk_exaggerated Q27.wearing_mask Q35.take_vaccine. Q40.age
Length:1020 Length:1020 Length:1020 Length:1020
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Q41.gender Q74.education Q84.community Q43.health.1.5.
Length:1020 Length:1020 Length:1020 Min. :1.000
Class :character Class :character Class :character 1st Qu.:3.000
Mode :character Mode :character Mode :character Median :4.000
Mean :3.552
3rd Qu.:4.000
Max. :5.000
NA's :190
Q39.flu.vaccine Q44.contact_covid Q85.pre.exisiting_conditions
Length:1020 Length:1020 Length:1020
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Q92.friends.family_preexisiting_conditions
Length:1020
Class :character
Mode :character
When called on a data frame, this command gives you summary information for each column.
- For numeric columns, gives you mean, median, and quartile information.
- For character columns, does not give you much information, just the length and type.
- For factors, gives much more information: shows all the levels (i.e., sub-groups), and also how many observations are in each group.
(b)
Now use the use the str()
function in the same way. What do these do? What do you see? What types of data do we have in our data frame?
str(covid_attitudes)
'data.frame': 1020 obs. of 29 variables:
$ sub_ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Q6.consent : int 1 1 1 1 1 1 1 1 1 1 ...
$ Q8.covid.info : chr "A moderate amount" "A lot" "A lot" "A moderate amount" ...
$ Q9.covid_knowledge : chr NA "A little" "A moderate amount" "A moderate amount" ...
$ Q10.rank_attention_to_news : int 3 6 8 4 NA 5 8 9 9 NA ...
$ Q13.trust_doctor_news : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Q13.trust_hospital_news : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Q13.trust_cdc_news : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ Q13.trust_government_news : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Q13.trust_none_of_above : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
$ Q14.confidence_us_government : int NA 6 5 4 NA 3 5 3 8 NA ...
$ Q16.Belief_scientists_understand_covid : int NA 5 9 3 NA 4 6 8 10 NA ...
$ Q17.concerned_outbreak : chr "A moderate amount" "A little" "A moderate amount" NA ...
$ Q18.likely_to_catch_covid : int NA 29 50 9 NA 22 40 69 85 NA ...
$ Q20.ability_to_protect_self : int NA 50 70 49 NA 81 61 70 100 NA ...
$ Q21.expected_symptom_severity : int NA 4 2 3 NA 3 3 3 2 NA ...
$ Q101.confidence_in_authority : int NA 46 40 30 NA 50 60 40 80 NA ...
$ Q23.risk_exaggerated : chr NA "somewhat disagree" NA "strongly disagree" ...
$ Q27.wearing_mask : chr NA "sick people" "sick people" "everyone" ...
$ Q35.take_vaccine. : chr NA "definitely" "probably" "definitely" ...
$ Q40.age : chr NA "30-34" "25-29" "50-54" ...
$ Q41.gender : chr NA "f" "f" "m" ...
$ Q74.education : chr NA "doctorate" "professional degree" "doctorate" ...
$ Q84.community : chr NA "large city" "ruralArea" "small city/town" ...
$ Q43.health.1.5. : int 3 3 3 3 NA 4 3 5 3 NA ...
$ Q39.flu.vaccine : chr "FALSE" "TRUE" "FALSE" "TRUE" ...
$ Q44.contact_covid : chr "no" "unsure" "no" "no" ...
$ Q85.pre.exisiting_conditions : chr "prefer not to say" "no" "no" "no" ...
$ Q92.friends.family_preexisiting_conditions: chr "no" "no" "no" "yes" ...
str stands for “structure”. This function gives you the structure of the data frame, which is similar to the information that the global environment gives you. It shows you all of the columns and what type they are. It also gives you a sample of the data in each column.
This data frame has variables that are integers (a type of numeric variable), characters, numeric, and logical.
(2)
Use the ncol()
function. How many columns in our data?
ncol(covid_attitudes) # this function gives us the number of columns in the data frame
[1] 29
There are 29 columns.
Note: If you check the output of your str() function or in the global environment, this is equal to the number of variables.
(3)
Use the nrow()
function. How many rows in our data?
nrow(covid_attitudes) # this function gives us the number of rows in our data frame
[1] 1020
There are 1020 rows.
Note: If you check the output of your str() function or in the global environment, this is equal to the number of obs. (which stands for observations)
Also Note: This does not count the row with the col names, it knows that those are the names for the variables/columns.
(4)
(a)
Some of these columns should be factors! Lets turn the education column into a factor. Use the command factor(covid_attitudes$Q74.education)
but be sure to save the factor as a column in your dataframe:
$Q74.education.factor <- factor(covid_attitudes$Q74.education) covid_attitudes
When we create a factor, we want to make sure we are creating a new column within our data frame where that can live. To make a new column within our data frame, we have the name of our data frame, our $
, and then the name we are making for this new column.
(b)
Pick some other column you think should be a factor and turn it into a factor:
$Q8.covid.info.factor <- factor(covid_attitudes$Q8.covid.info) covid_attitudes
(5)
Now use the summary()
command to run on summary on just the column with your new factor. Does the description change from how it was before? What does it look like now?
summary(covid_attitudes$Q74.education.factor)
4 year degree 2 year degree doctorate
159 76 112
highschool graduate less than highschool professional degree
91 7 86
some college NA's
301 188
Before we made it a factor, the summary() function just told us that it was a character vector and had a length of 1020 (which is the number of observations)
Now that it is a factor, we can see more information! We now see the levels and the number of observations (people) who reported each level of education (and NA).
(6)
Which columns have the most NAs? Use the summary()
command to investigate!
summary(covid_attitudes)
sub_ID Q6.consent Q8.covid.info Q9.covid_knowledge
Min. : 1.0 Min. :1 Length:1020 Length:1020
1st Qu.: 255.8 1st Qu.:1 Class :character Class :character
Median : 510.5 Median :1 Mode :character Mode :character
Mean : 510.5 Mean :1
3rd Qu.: 765.2 3rd Qu.:1
Max. :1020.0 Max. :1
NA's :47
Q10.rank_attention_to_news Q13.trust_doctor_news Q13.trust_hospital_news
Min. : 0.000 Mode :logical Mode :logical
1st Qu.: 5.000 FALSE:895 FALSE:939
Median : 7.000 TRUE :125 TRUE :81
Mean : 6.792
3rd Qu.: 8.000
Max. :10.000
NA's :120
Q13.trust_cdc_news Q13.trust_government_news Q13.trust_none_of_above
Mode :logical Mode :logical Length:1020
FALSE:369 FALSE:937 Class :character
TRUE :651 TRUE :83 Mode :character
Q14.confidence_us_government Q16.Belief_scientists_understand_covid
Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 5.000
Median : 4.000 Median : 6.000
Mean : 3.968 Mean : 5.961
3rd Qu.: 6.000 3rd Qu.: 7.000
Max. :10.000 Max. :10.000
NA's :141 NA's :137
Q17.concerned_outbreak Q18.likely_to_catch_covid Q20.ability_to_protect_self
Length:1020 Min. : 0.00 Min. : 0.00
Class :character 1st Qu.: 20.00 1st Qu.: 61.00
Mode :character Median : 41.00 Median : 77.00
Mean : 42.16 Mean : 73.53
3rd Qu.: 60.00 3rd Qu.: 90.00
Max. :100.00 Max. :100.00
NA's :153 NA's :150
Q21.expected_symptom_severity Q101.confidence_in_authority
Min. :1.000 Min. : 0.00
1st Qu.:2.000 1st Qu.: 29.00
Median :3.000 Median : 47.00
Mean :2.857 Mean : 45.15
3rd Qu.:3.000 3rd Qu.: 62.00
Max. :6.000 Max. :100.00
NA's :154 NA's :157
Q23.risk_exaggerated Q27.wearing_mask Q35.take_vaccine. Q40.age
Length:1020 Length:1020 Length:1020 Length:1020
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Q41.gender Q74.education Q84.community Q43.health.1.5.
Length:1020 Length:1020 Length:1020 Min. :1.000
Class :character Class :character Class :character 1st Qu.:3.000
Mode :character Mode :character Mode :character Median :4.000
Mean :3.552
3rd Qu.:4.000
Max. :5.000
NA's :190
Q39.flu.vaccine Q44.contact_covid Q85.pre.exisiting_conditions
Length:1020 Length:1020 Length:1020
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Q92.friends.family_preexisiting_conditions Q74.education.factor
Length:1020 some college :301
Class :character 4 year degree :159
Mode :character doctorate :112
highschool graduate: 91
professional degree: 86
(Other) : 83
NA's :188
Q8.covid.info.factor
A little : 33
A lot :602
A moderate amount:268
NA's :117
This depends on which you made into factors! For example.
Q43 has 190 NAs! (it is a numeric variable, so will show NAs)
Q84 has 188 NAs (but we only know this if we turned it into a factor!
(7)
What do you notice about how the NAs are represented in different columns? In summary()? When you View() it?
# summary(covid_attitudes)
# View(covid_attitudes)
I notice that for the column covid_attitudes$Q13.trust_none_of_above
, missing data is indicated by a blank cell, wheareas for others they are indicated as italicised NA’s. Thise italicized NA’s come up as “NA’s” in the summary.
Different labs format missing data differently and it is something to keep an eye out for in the future as it’s really important to make sure that it is formatted properly.
(8)
On average, how likely do people think they are to catch Covid-19? Hint: You first need to find the relevant column to answer this question, we first have to find the relevant column: Q_18
Then, use summary()
to find the mean:
summary(covid_attitudes$Q18.likely_to_catch_covid)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 20.00 41.00 42.16 60.00 100.00 153
The mean is 42.16, so on average, people think they have a 42.16% chance of catching Covid.
(9)
How many types of living communities are there? To answer this question, we first have to find the relevant column: Q_84.community.
If you haven’t already, you will need to turn it into a factor:
$Q84.community.factor <- factor(covid_attitudes$Q84.community) covid_attitudes
Then, use summary()
to see the levels of the factor, as well as how many people or observations are in each level:
summary(covid_attitudes$Q84.community.factor)
large city ruralArea small city/town suburb NA's
150 66 260 356 188
There are 4 types of communities and NA.
(10)
If you have time: Think of another question you can ask and answer it!