Modified

December 4, 2024

This page describes the process of data gathering, cleaning, and visualization.

Gathering

We use a Google Sheet to store the by-study data:

https://docs.google.com/spreadsheets/d/1UFZkbh9oU4JHpYsrkDQcNmDyqD4J-qB74dhyMzIkqKs/edit#gid=0

Note

Note: There is no identifiable data here at the moment, so Google Sheets are a viable option.

Later on, we start contacting authors, we will need to restrict access to that information for privacy reasons.

Important

We need a process for managing who has edit access.

The googledrive package provides a convenient way to access documents stored on Google.

Download from Google as CSV

Code
if (!dir.exists(params$data_dir)) {
  message("Creating missing ",  params$data_dir, ".")
  dir.create(params$data_dir)
}

if (params$update_data) {
  if (params$use_sysenv_creds) {
    google_creds <- Sys.getenv("GMAIL_SURVEY")
    if (google_creds != "") {
      options(gargle_oauth_email = google_creds)
      googledrive::drive_auth()
    } else {
      message("No Google account information stored in `.Renviron`.")
      message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
    }
  }

  this_sheet <- googlesheets4::read_sheet(ss = params$google_data_url,
                            sheet = params$sheet_name)
  out_fn <- file.path(params$data_dir, params$data_fn)
  readr::write_csv(this_sheet, out_fn)
  message("Data updated: ", out_fn)
} else {
  message("Using stored data.")
}
✔ Reading from "Legacy Project Acuity Data: By Paper".
✔ Range ''by_paper''.
Data updated: data/csv/by-paper.csv

The data file has been saved as a comma-separated value (CSV) format data file in a special directory called csv/.

Open CSV

Next we load the data file.

Code
acuity_df <-
  readr::read_csv(file.path(params$data_dir, "by-paper.csv"), show_col_types = FALSE)

We’ll show the column (variable names) since these will be part of our data dictionary.

Code
acuity_cols <- names(acuity_df)
acuity_cols
 [1] "author_first"         "citation"             "pub_year"            
 [4] "fig_table"            "age_mos"              "age_grp_rog"         
 [7] "binoc_monoc"          "n_participants"       "distance_cm"         
[10] "start_card_cyc_deg"   "mean_acuity_cyc_deg"  "lower_limit_cyc_deg" 
[13] "closest_card_cyc_deg" "upper_limit_cyc_deg"  "country"             
[16] "card_type"           

Create data dictionary

We’ll start by creating a data dictionary so that we can refer to it later in our cleaning and data analysis. We do this by creating a data frame or ‘tibble’ because this is a convenient format for manipulating the information.

Code
acuity_data_dict <- tibble::tibble(col_name = names(acuity_df))

Now, we write a short description of each variable in the data file.

Code
acuity_data_dict <- acuity_data_dict |>
  dplyr::mutate(col_desc = c("Last name of 1st author",
                             "Full APA format citation",
                             "Paper publication year",
                             "Source in paper",
                             "Reported age range in mos",
                             "Age in mos as conformed by ROG",
                             "Participants tested monocularly or binocularly",
                             "Number of participants in group",
                             "Testing distance in cm",
                             "Starting card in cyc/deg",
                             "Mean (group) acuity in cyc/deg",
                             "Estimated lower limit of acuity in cyc/deg",
                             "Teller Acuity Card closest equivalent to this lower limit",
                             "Estimated upper limit of acuity in cyc/deg",
                             "Country where data were collected",
                             "TAC-I or TAC-II"))

acuity_data_dict |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()
col_name col_desc
author_first Last name of 1st author
citation Full APA format citation
pub_year Paper publication year
fig_table Source in paper
age_mos Reported age range in mos
age_grp_rog Age in mos as conformed by ROG
binoc_monoc Participants tested monocularly or binocularly
n_participants Number of participants in group
distance_cm Testing distance in cm
start_card_cyc_deg Starting card in cyc/deg
mean_acuity_cyc_deg Mean (group) acuity in cyc/deg
lower_limit_cyc_deg Estimated lower limit of acuity in cyc/deg
closest_card_cyc_deg Teller Acuity Card closest equivalent to this lower limit
upper_limit_cyc_deg Estimated upper limit of acuity in cyc/deg
country Country where data were collected
card_type TAC-I or TAC-II

Data visualization

Important

Rick Gilmore decided to take the mean of the age range reported in the (Xiang et al., 2021) data and create a new variable strictly for visualization purposes, age_grp_rog.

We are still in the early phases of the project (as of 2024-12-04 14:21:01.897708), but it is good to start sketching the the data visualizations we will eventually want to see.

Code
library(ggplot2)
acuity_df |>
  ggplot() +
  aes(
    x = age_grp_rog,
    y = mean_acuity_cyc_deg,
    color = country,
    shape = author_first
  ) +
  geom_point() +
  geom_smooth() +
  facet_grid(cols = vars(binoc_monoc))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 3
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 3
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 3.2363e-17
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 3
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 3.2363e-17
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 8 values. Consider specifying shapes manually if you need
  that many have them.
Warning: Removed 29 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 4.1: Developmental time course of mean acuity as assessed by Teller Acuity Cards

By-individual data

The Gilmore lab has some archival data that we can potentially use in this project. The following represents Rick Gilmore’s work to gather, clean, and visualize these data.

Gathering

The de-identified archival data are stored in a Google sheet accessed by the lab Google account.

First, we must authenticate to Google to access the relevant file and download it.

Code
options(gargle_oauth_email = "psubrainlab@gmail.com")
googledrive::drive_auth()

Then we download the relevant file.

Code
googledrive::drive_download(
  "vep-session-log",
  path = file.path(params$data_dir, "by-participant.csv"),
  type = "csv",
  overwrite = TRUE
)
File downloaded:
• 'vep-session-log' <id: 1Y9GoJU6EFUxcNmbWOLr52enpkuCyg_wSFv3pA-SWm4Q>
Saved locally as:
• 'data/csv/by-participant.csv'

Unlike the Google sheet newly created for the by-study data, this one requires a lot of cleaning.

Code
gilmore_archival_df <-
  readr::read_csv(file.path(params$data_dir, "by-participant.csv"),
                  show_col_types = FALSE)
names(gilmore_archival_df)
 [1] "Participant Number in an IRB year"  "IRB Count"                         
 [3] "Date"                               "Time"                              
 [5] "Session-Leader"                     "Sex"                               
 [7] "Participant-ID"                     "DOB"                               
 [9] "infant"                             "Stimset-name"                      
[11] "Backed up to project folder on Box" "Export-Status"                     
[13] "Snellen-Acuity"                     "Teller Acuity Cards"               
[15] "Stereo-Acuity"                      "Zipped on Drive"                   
[17] "Age at test"                        "keep-session"                      
[19] "Comments"                          

We’ll keep Date, Time, Sex, DOB, Teller Acuity Cards, Age at test.

Code
gilmore_archival_df <- gilmore_archival_df |>
  dplyr::select(Date, Time, Sex, DOB, `Teller Acuity Cards`, `Age at test`)

Then, let’s filter those where we have TAC data.

Code
with(gilmore_archival_df, unique(`Teller Acuity Cards`))
 [1] "20/94 6.4 cyc/deg @ 55cm"                                   
 [2] "20/190 3.1 cyc/deg @ 55cm"                                  
 [3] "20/470 1.3 cyc/deg @ 55cm"                                  
 [4] "20/380 1.6 cyc/deg @ 55cm - difficulty attending toward end"
 [5] "20/130 4.7 cyc/deg @ 55cm"                                  
 [6] NA                                                           
 [7] "not interested"                                             
 [8] "20/94 6.54 cyc/deg @ 55cm"                                  
 [9] "20/130 4/7 cyc/deg @ 55cm"                                  
[10] "20/94 6.40 cyc/deg @ 55cm"                                  
[11] "20/170 3.6 cyc/deg @84cm"                                   
[12] "20/190 3.1 cyc/deg @55cm"                                   
[13] "20/94 6.5 cyc/deg @ 55cm"                                   
Code
gilmore_archival_df <- gilmore_archival_df |>
  dplyr::filter(!is.na(`Teller Acuity Cards`),
                `Teller Acuity Cards` != "not interested")

dim(gilmore_archival_df)
[1] 34  6
Note

This file illustrates how making data FAIR from the outset can save work.

This one is not too terribly hard to parse, but it could have been better planned.

We’ll extract the viewing distance with a regular expression.

Code
gilmore_archival_df <- gilmore_archival_df |>
  dplyr::mutate(view_dist_cm = stringr::str_extract(`Teller Acuity Cards`, "[0-9]{2}cm")) |>
  dplyr::mutate(view_dist_cm = stringr::str_remove(view_dist_cm, "cm")) # remove 'cm'
gilmore_archival_df$view_dist_cm
 [1] "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55"
[16] "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "84"
[31] "55" "55" "55" "55"

Similarly, we’ll extract the acuity in cyc/deg using a regular expression.

Code
gilmore_archival_df <- gilmore_archival_df |>
  # add 'cyc' to separate cyc/deg from Snellen acuity
  dplyr::mutate(acuity_cyc_deg = stringr::str_extract(`Teller Acuity Cards`, "[0-9]{1}[\\./]{1}[0-9]+ cyc")) |>
  dplyr::mutate(acuity_cyc_deg = stringr::str_remove(acuity_cyc_deg, " cyc")) |>
  dplyr::mutate(acuity_cyc_deg = stringr::str_replace(acuity_cyc_deg, "/", "."))

gilmore_archival_df$acuity_cyc_deg
 [1] "6.4"  "3.1"  "1.3"  "3.1"  "3.1"  "1.6"  "4.7"  "3.1"  "3.1"  "3.1" 
[11] "6.4"  "4.7"  "3.1"  "4.7"  "3.1"  "4.7"  "4.7"  "6.4"  "4.7"  "4.7" 
[21] "6.54" "4.7"  "4.7"  "6.40" "4.7"  "4.7"  "4.7"  "3.1"  "3.1"  "3.6" 
[31] "4.7"  "3.1"  "6.5"  "3.1" 

Now, let’s look at the age at test.

Code
gilmore_archival_df$`Age at test`
 [1] "0 Years, 6 Months, 25 Days" "0 Years, 5 Months, 28 Days"
 [3] "0 Years, 5 Months, 8 Days"  "0 Years, 6 Months, 14 Days"
 [5] "0 Years, 4 Months, 27 Days" "0 Years, 4 Months, 2 Days" 
 [7] "0 Years, 4 Months, 21 Days" "0 Years, 4 Months, 27 Days"
 [9] "0 Years, 3 Months, 7 Days"  "0 Years, 4 Months, 10 Days"
[11] "0 Years, 6 Months, 24 Days" "0 Years, 7 Months, 26 Days"
[13] "0 Years, 8 Months, 8 Days"  "0 Years, 8 Months, 21 Days"
[15] "0 Years, 8 Months, 2 Days"  "0 Years, 8 Months, 2 Days" 
[17] "0 Years, 7 Months, 18 Days" "0 Years, 7 Months, 27 Days"
[19] "0 Years, 8 Months, 7 Days"  "0 Years, 7 Months, 12 Days"
[21] "0 Years, 5 Months, 26 Days" "0 Years, 8 Months, 3 Days" 
[23] "0 Years, 5 Months, 18 Days" "0 Years, 6 Months, 25 Days"
[25] "0 Years, 8 Months, 14 Days" "0 Years, 4 Months, 17 Days"
[27] "0 Years, 8 Months, 7 Days"  "0 Years, 8 Months, 7 Days" 
[29] "0 Years, 8 Months, 0 Days"  "0 Years, 6 Months, 2 Days" 
[31] "0 Years, 7 Months, 29 Days" "0 Years, 7 Months, 13 Days"
[33] "0 Years, 6 Months, 5 Days"  "0 Years, 5 Months, 20 Days"

Instead, let’s see what it looks like to compute age at test from the dates.

Code
gilmore_archival_df <- gilmore_archival_df |>
  dplyr::mutate(age_at_test_days = lubridate::mdy(Date) - lubridate::mdy(DOB))

gilmore_archival_df$age_at_test_days
Time differences in days
 [1] 208 181 161 198 149 125 143 149  99 132 207 239 252 265 245 245 231 242 252
[20] 225 179 247 171 209 259 139 249 250 243 186 241 225 186 171

That seems reasonable for now.

Let’s see if we can plot these data.

Code
gilmore_archival_df |>
  dplyr::mutate(acuity_cyc_deg = as.numeric(acuity_cyc_deg)) |>
  ggplot() +
  aes(x = age_at_test_days, y = acuity_cyc_deg, color = Sex) +
  geom_point() +
  geom_smooth(method = "lm") +
  #theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) 
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
`geom_smooth()` using formula = 'y ~ x'
Figure 4.2: Individual participant Teller Acuity Card thresholds from archival Gilmore lab data
Note

Before I stop, I’m going to add the by-participant data file to a .gitignore file, just to be extra careful.

Xiang, Y., Long, E., Liu, Z., Li, X., Lin, Z., Zhu, Y., … Lin, H. (2021). Study to establish visual acuity norms with teller acuity cards II for infants from southern china. Eye, 35(10), 2787–2792. https://doi.org/10.1038/s41433-020-01314-y