Cleaning SASE 1 Baseline + BP Data

Author

Sierra Semko Krouse

Published

December 1, 2023

Pre-cleaning Steps (in Excel)

Baseline Survey

  1. Removed extra variables from Qualtrics
  2. Compared raw data downloaded from Qualtrics to CJ’s cleaned data from 2021
    1. Compared order of variables to ensure match
    2. Compared PIDs to ensure match
  3. Removed empty rows
  4. For participants with multiple Baseline entries, kept the most completed OR last fully-completed response (following CJ’s scheme from 2021)
  5. To deal with overwriting of data from changes to the Qualtrics survey, replaced raw data from Qualtrics with cleaned data of CJ’s for applicable participants
  6. Removed participants with PID in the 400s because they were SASE 2 participants and not meant to be included in the dataset
  7. Removed “.” as missing values because when reading in, R will treat the column as character rather than numeric
    • As you’ll see below, I didn’t do this very thoroughly in Excel, but just managed it here in R!
  8. In EA data, changed entry from 9/20/22 21:10 with PID 314 to PID 341
  9. In EA data, added empty columns to match RRV1 to RSR7.2 in AA data

BP

  1. Compared RA entries for Night 1 BP
  2. Deleted discrepant entries (as single values, not entire rows)

Cleaning (in R)

Loading packages

library(readxl)
library(tidyverse)
library(psych)

Reading in data

black <- read_xlsx("Outdated+Unneeded_Data/SASE1_Background-Survey_AA_20231201_CLEAN.xlsx")
white <- read_xlsx("Outdated+Unneeded_Data/SASE1_Background-Survey_EA_20231201_CLEAN.xlsx")
bp <- read_xlsx("Outdated+Unneeded_Data/SASE1_BP_N1_20231201_CLEAN.xlsx")

Merging Black + White baseline data

First need to make sure that the variables / columns are of the same type, and coerce the ones that aren’t

NOTE: These are almost all happening because of “.” entered for missing data. With R, it’s best to leave it blank or input NA (without quotes) if the data are numeric

Luckily, coercion to numeric with as.numeric() just changes the “.” to NA

black <- black %>% 
  mutate(EED8 = as.numeric(EED8),
         SDO11 = as.numeric(SDO11),
         SDO12 = as.numeric(SDO12),
         RRQ10 = as.numeric(RRQ10),
         RRQ11 = as.numeric(RRQ11),
         RRQ12 = as.numeric(RRQ12),
         RRQ16 = as.numeric(RRQ16))

white <- white %>% 
  mutate(PSS8 = as.numeric(PSS8),
         PUM1 = as.numeric(PUM1),
         PUM2 = as.numeric(PUM2),
         PUM3 = as.numeric(PUM3),
         PUM4 = as.numeric(PUM4),
         PUM5 = as.numeric(PUM5),
         PUM6 = as.numeric(PUM6),
         PUM7 = as.numeric(PUM7),
         PUM8 = as.numeric(PUM8),
         PUM9 = as.numeric(PUM9),
         PUM10 = as.numeric(PUM10),
         PUM11 = as.numeric(PUM11),
         PUM12 = as.numeric(PUM12),
         PUM13 = as.numeric(PUM13),
         PUM14 = as.numeric(PUM14),
         MEQ1 = as.numeric(MEQ1),
         MEQ2 = as.numeric(MEQ2),
         MEQ3 = as.numeric(MEQ3),
         MEQ4 = as.numeric(MEQ4),
         MEQ5 = as.numeric(MEQ5),
         MEQ6 = as.numeric(MEQ6),
         MEQ7 = as.numeric(MEQ7),
         MEQ8 = as.numeric(MEQ8),
         MEQ9 = as.numeric(MEQ9),
         MEQ10 = as.numeric(MEQ10),
         MEQ11 = as.numeric(MEQ11),
         MEQ12 = as.numeric(MEQ12),
         MEQ13 = as.numeric(MEQ13),
         MEQ14 = as.numeric(MEQ14),
         MEQ15 = as.numeric(MEQ15),
         MEQ16 = as.numeric(MEQ16),
         MEQ17 = as.numeric(MEQ17),
         MEQ18 = as.numeric(MEQ18),
         MEQ19 = as.numeric(MEQ19))

Renaming the ICG variables in the Black participants dataset to be the same across Black + White data

black <- black %>% 
  rename(IGC1 = `IC AA 1`,
         IGC2 = `IC AA 2`,
         IGC3 = `IC AA 3`,
         IGC4 = `IC AA 4`,
         IGC5 = `IC AA 5`,
         IGC6 = `IC AA 6`,
         IGC7 = `IC AA 7`,
         IGC8 = `IC AA 8`,
         IGC9 = `IC AA 9`,
         IGC10 = `IC AA 10`)

NOTE: We have to wrap the old variable names (IC AA 1) in backticks because there are spaces in the variable names

Adding a race indicator variable to each dataframe

black$RACE <- "Black"
white$RACE <- "White"

Now, merge the two Baseline data frames

baseline <- bind_rows(black, white)

Merging Baseline data with BP data

For some reason, the PID column in the Baseline data is a character, so going to change that to numeric

baseline <- baseline %>% 
  mutate(PID = as.numeric(PID))

Then merge!

data <- full_join(baseline, bp, by = "PID")

The full_join function keeps ALL data in both datasets

Scoring scales

PSS

First reverse-score items based on information in Codebook

data$PSS4R <- dplyr::recode(data$PSS4, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS5R <- dplyr::recode(data$PSS5, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS6R <- dplyr::recode(data$PSS6, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS7R <- dplyr::recode(data$PSS7, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS9R <- dplyr::recode(data$PSS9, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)
data$PSS10R <- dplyr::recode(data$PSS10, `0` = 4, `1` = 3, `2` = 2, `3` = 1, `4` = 0)

Then sum the items

data$PSS_SUM <- data$PSS1 + data$PSS2 + data$PSS3 + data$PSS4R + data$PSS5R + data$PSS6R + data$PSS7R + data$PSS8 + data$PSS9R + data$PSS10R

EED

Sum the items

data$EED_SUM <- data$EED1 + data$EED2 + data$EED3 + data$EED4 + data$EED5 + data$EED6 + data$EED7 + data$EED8 + data$EED9

Save the dataframe(s)

I’m first going to save the combined baseline data

write_csv(baseline, "SASE1_Baseline-Combined_20231201.csv")

Then I’ll save the full dataset

write_csv(data, "SASE1_Baseline+BP_20231201.csv")

🎉 ET VOILA!