library(readxl)
library(tidyverse)
library(psych)
Cleaning SASE 1 Baseline + BP Data
Pre-cleaning Steps (in Excel)
Baseline Survey
- Removed extra variables from Qualtrics
- Compared raw data downloaded from Qualtrics to CJ’s cleaned data from 2021
- Compared order of variables to ensure match
- Compared PIDs to ensure match
- Removed empty rows
- For participants with multiple Baseline entries, kept the most completed OR last fully-completed response (following CJ’s scheme from 2021)
- To deal with overwriting of data from changes to the Qualtrics survey, replaced raw data from Qualtrics with cleaned data of CJ’s for applicable participants
- Removed participants with PID in the 400s because they were SASE 2 participants and not meant to be included in the dataset
- Removed “.” as missing values because when reading in, R will treat the column as character rather than numeric
- As you’ll see below, I didn’t do this very thoroughly in Excel, but just managed it here in R!
- In EA data, changed entry from 9/20/22 21:10 with PID 314 to PID 341
- In EA data, added empty columns to match RRV1 to RSR7.2 in AA data
BP
- Compared RA entries for Night 1 BP
- Deleted discrepant entries (as single values, not entire rows)
Cleaning (in R)
Loading packages
Reading in data
<- read_xlsx("Outdated+Unneeded_Data/SASE1_Background-Survey_AA_20231201_CLEAN.xlsx")
black <- read_xlsx("Outdated+Unneeded_Data/SASE1_Background-Survey_EA_20231201_CLEAN.xlsx")
white <- read_xlsx("Outdated+Unneeded_Data/SASE1_BP_N1_20231201_CLEAN.xlsx") bp
Merging Black + White baseline data
First need to make sure that the variables / columns are of the same type, and coerce the ones that aren’t
NOTE: These are almost all happening because of “.” entered for missing data. With R, it’s best to leave it blank or input NA (without quotes) if the data are numeric
Luckily, coercion to numeric with as.numeric()
just changes the “.” to NA
<- black %>%
black mutate(EED8 = as.numeric(EED8),
SDO11 = as.numeric(SDO11),
SDO12 = as.numeric(SDO12),
RRQ10 = as.numeric(RRQ10),
RRQ11 = as.numeric(RRQ11),
RRQ12 = as.numeric(RRQ12),
RRQ16 = as.numeric(RRQ16))
<- white %>%
white mutate(PSS8 = as.numeric(PSS8),
PUM1 = as.numeric(PUM1),
PUM2 = as.numeric(PUM2),
PUM3 = as.numeric(PUM3),
PUM4 = as.numeric(PUM4),
PUM5 = as.numeric(PUM5),
PUM6 = as.numeric(PUM6),
PUM7 = as.numeric(PUM7),
PUM8 = as.numeric(PUM8),
PUM9 = as.numeric(PUM9),
PUM10 = as.numeric(PUM10),
PUM11 = as.numeric(PUM11),
PUM12 = as.numeric(PUM12),
PUM13 = as.numeric(PUM13),
PUM14 = as.numeric(PUM14),
MEQ1 = as.numeric(MEQ1),
MEQ2 = as.numeric(MEQ2),
MEQ3 = as.numeric(MEQ3),
MEQ4 = as.numeric(MEQ4),
MEQ5 = as.numeric(MEQ5),
MEQ6 = as.numeric(MEQ6),
MEQ7 = as.numeric(MEQ7),
MEQ8 = as.numeric(MEQ8),
MEQ9 = as.numeric(MEQ9),
MEQ10 = as.numeric(MEQ10),
MEQ11 = as.numeric(MEQ11),
MEQ12 = as.numeric(MEQ12),
MEQ13 = as.numeric(MEQ13),
MEQ14 = as.numeric(MEQ14),
MEQ15 = as.numeric(MEQ15),
MEQ16 = as.numeric(MEQ16),
MEQ17 = as.numeric(MEQ17),
MEQ18 = as.numeric(MEQ18),
MEQ19 = as.numeric(MEQ19))
Renaming the ICG variables in the Black participants dataset to be the same across Black + White data
<- black %>%
black rename(IGC1 = `IC AA 1`,
IGC2 = `IC AA 2`,
IGC3 = `IC AA 3`,
IGC4 = `IC AA 4`,
IGC5 = `IC AA 5`,
IGC6 = `IC AA 6`,
IGC7 = `IC AA 7`,
IGC8 = `IC AA 8`,
IGC9 = `IC AA 9`,
IGC10 = `IC AA 10`)
NOTE: We have to wrap the old variable names (IC AA 1) in backticks because there are spaces in the variable names
Adding a race indicator variable to each dataframe
$RACE <- "Black"
black$RACE <- "White" white
Now, merge the two Baseline data frames
<- bind_rows(black, white) baseline
Merging Baseline data with BP data
For some reason, the PID column in the Baseline data is a character, so going to change that to numeric
<- baseline %>%
baseline mutate(PID = as.numeric(PID))
Then merge!
<- full_join(baseline, bp, by = "PID") data
The full_join
function keeps ALL data in both datasets
Scoring scales
PSS
First reverse-score items based on information in Codebook
$PSS4R <- dplyr::recode(data$PSS4, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS5R <- dplyr::recode(data$PSS5, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS6R <- dplyr::recode(data$PSS6, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS7R <- dplyr::recode(data$PSS7, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS9R <- dplyr::recode(data$PSS9, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS10R <- dplyr::recode(data$PSS10, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0) data
Then sum the items
$PSS_SUM <- data$PSS1 + data$PSS2 + data$PSS3 + data$PSS4R + data$PSS5R + data$PSS6R + data$PSS7R + data$PSS8 + data$PSS9R + data$PSS10R data
EED
Sum the items
$EED_SUM <- data$EED1 + data$EED2 + data$EED3 + data$EED4 + data$EED5 + data$EED6 + data$EED7 + data$EED8 + data$EED9 data
Save the dataframe(s)
I’m first going to save the combined baseline data
write_csv(baseline, "SASE1_Baseline-Combined_20231201.csv")
Then I’ll save the full dataset
write_csv(data, "SASE1_Baseline+BP_20231201.csv")
🎉 ET VOILA!