Overview

Modified

April 15, 2025

This page describes the process of data gathering, cleaning, and visualization.

Gathering

We use a Google Sheet to store the by-study data:

https://docs.google.com/spreadsheets/d/1UFZkbh9oU4JHpYsrkDQcNmDyqD4J-qB74dhyMzIkqKs/edit#gid=0

Note

Note: There is no identifiable data here at the moment, so Google Sheets are a viable option.

Later on, we start contacting authors, we will need to restrict access to that information for privacy reasons.

Important

We need a process for managing who has edit access.

The googledrive package provides a convenient way to access documents stored on Google.

Download from Google as CSV

Code

if (!dir.exists(params$data_dir)) {
  message("Creating missing ",  params$data_dir, ".")
  dir.create(params$data_dir)
}

if (params$update_data) {
  if (params$use_sysenv_creds) {
    google_creds <- Sys.getenv("GMAIL_SURVEY")
    if (google_creds != "") {
      options(gargle_oauth_email = google_creds)
      googledrive::drive_auth()
    } else {
      message("No Google account information stored in `.Renviron`.")
      message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
    }
  }

  this_sheet <- googlesheets4::read_sheet(ss = params$google_data_url,
                            sheet = params$sheet_name)
  out_fn <- file.path(params$data_dir, params$data_fn)
  readr::write_csv(this_sheet, out_fn)
  message("Data updated: ", out_fn)
} else {
  message("Using stored data.")
}

✔ Reading from "Legacy Project Acuity Data: By Paper".

✔ Range ''typical_group''.

Data updated: data/csv/typical_group.csv

The data file has been saved as a comma-separated value (CSV) format data file in a special directory called csv/.

Open CSV

Next we load the data file.

Code

acuity_df <-
  readr::read_csv(file.path(params$data_dir, "by-paper.csv"), show_col_types = FALSE)

We’ll show the column (variable names) since these will be part of our data dictionary.

Code

acuity_cols <- names(acuity_df)
acuity_cols

 [1] "author_first"         "citation"             "pub_year"            
 [4] "fig_table"            "age_mos"              "age_grp_rog"         
 [7] "binoc_monoc"          "typ"                  "n_participants"      
[10] "distance_cm"          "start_card_cyc_deg"   "mean_acuity_cyc_deg" 
[13] "lower_limit_cyc_deg"  "closest_card_cyc_deg" "upper_limit_cyc_deg" 
[16] "country"              "card_type"

Create data dictionary

We’ll start by creating a data dictionary so that we can refer to it later in our cleaning and data analysis. We do this by creating a data frame or ‘tibble’ because this is a convenient format for manipulating the information.

Code

acuity_data_dict <- tibble::tibble(col_name = names(acuity_df))

Now, we write a short description of each variable in the data file.

Code

acuity_data_dict <- acuity_data_dict |>
  dplyr::mutate(col_desc = c("Last name of 1st author",
                             "Full APA format citation",
                             "Paper publication year",
                             "Source in paper",
                             "Reported age range in mos",
                             "Age in mos as conformed by ROG",
                             "Participants tested monocularly or binocularly",
                             "Typical or atypically developing",
                             "Number of participants in group",
                             "Testing distance in cm",
                             "Starting card in cyc/deg",
                             "Mean (group) acuity in cyc/deg",
                             "Estimated lower limit of acuity in cyc/deg",
                             "Teller Acuity Card closest equivalent to this lower limit",
                             "Estimated upper limit of acuity in cyc/deg",
                             "Country where data were collected",
                             "TAC-I or TAC-II"))

acuity_data_dict |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()

col_name	col_desc
author_first	Last name of 1st author
citation	Full APA format citation
pub_year	Paper publication year
fig_table	Source in paper
age_mos	Reported age range in mos
age_grp_rog	Age in mos as conformed by ROG
binoc_monoc	Participants tested monocularly or binocularly
typ	Typical or atypically developing
n_participants	Number of participants in group
distance_cm	Testing distance in cm
start_card_cyc_deg	Starting card in cyc/deg
mean_acuity_cyc_deg	Mean (group) acuity in cyc/deg
lower_limit_cyc_deg	Estimated lower limit of acuity in cyc/deg
closest_card_cyc_deg	Teller Acuity Card closest equivalent to this lower limit
upper_limit_cyc_deg	Estimated upper limit of acuity in cyc/deg
country	Country where data were collected
card_type	TAC-I or TAC-II

Data visualization

Important

Rick Gilmore decided to take the mean of the age range reported in the (Xiang et al., 2021) data and create a new variable strictly for visualization purposes, age_grp_rog.

We are still in the early phases of the project (as of 2025-04-22 16:35:40.396537), but it is good to start sketching the the data visualizations we will eventually want to see.

Code

library(ggplot2)
acuity_df |>
  ggplot() +
  aes(
    x = age_grp_rog,
    y = mean_acuity_cyc_deg,
    color = country
  ) +
  geom_point() +
  geom_smooth() +
  facet_grid(cols = vars(binoc_monoc))

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 3

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 3

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 3.2363e-17

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 3

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 3.2363e-17

Figure 4.1: Developmental time course of mean acuity as assessed by Teller Acuity Cards

Number of total participants.

Code

acuity_df |>
  dplyr::filter(!is.na(n_participants)) |>
  dplyr::mutate(n_participants_tot = sum(n_participants)) |>
  dplyr::select(n_participants_tot) |>
  unique()

# A tibble: 1 × 1
  n_participants_tot
               <dbl>
1               3991

Code

acuity_df |>
  dplyr::group_by(age_grp_rog) |>
  dplyr::mutate(min_acuity = min(mean_acuity_cyc_deg),
                max_acuity = max(mean_acuity_cyc_deg),
                max_minus_min = max_acuity - min_acuity) |>
  dplyr::select(age_grp_rog, min_acuity, max_acuity, max_minus_min) |>
  dplyr::arrange(age_grp_rog) |>
  kableExtra::kable(format='html') |>
  kableExtra::kable_classic()

age_grp_rog	min_acuity	max_acuity	max_minus_min
0.0000000	0.70	1.39	0.69
0.0000000	0.70	1.39	0.69
0.0000000	0.70	1.39	0.69
0.0000000	0.70	1.39	0.69
0.0000000	0.70	1.39	0.69
0.0000000	0.70	1.39	0.69
0.2307692	0.39	0.65	0.26
0.2307692	0.39	0.65	0.26
0.5000000	0.66	0.66	0.00
0.9230769	0.60	0.76	0.16
0.9230769	0.60	0.76	0.16
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.0000000	0.80	1.30	0.50
1.5000000	1.11	1.11	0.00
2.0000000	1.40	2.31	0.91
2.0000000	1.40	2.31	0.91
2.0000000	1.40	2.31	0.91
2.0000000	1.40	2.31	0.91
2.0000000	1.40	2.31	0.91
2.5000000	1.18	2.16	0.98
2.5000000	1.18	2.16	0.98
2.7692308	1.05	1.76	0.71
2.7692308	1.05	1.76	0.71
3.0000000	2.60	4.10	1.50
3.0000000	2.60	4.10	1.50
3.0000000	2.60	4.10	1.50
3.0000000	2.60	4.10	1.50
3.0000000	2.60	4.10	1.50
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
4.0000000	1.97	5.48	3.51
5.0000000	2.32	4.30	1.98
5.0000000	2.32	4.30	1.98
5.5000000	3.25	3.25	0.00
5.5384615	4.02	5.00	0.98
5.5384615	4.02	5.00	0.98
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
6.0000000	4.70	7.80	3.10
7.5000000	5.72	6.33	0.61
7.5000000	5.72	6.33	0.61
8.0000000	4.37	9.81	5.44
8.0000000	4.37	9.81	5.44
8.0000000	4.37	9.81	5.44
8.5000000	2.88	2.88	0.00
9.0000000	4.00	9.60	5.60
9.0000000	4.00	9.60	5.60
9.0000000	4.00	9.60	5.60
9.0000000	4.00	9.60	5.60
9.0000000	4.00	9.60	5.60
10.0000000	10.88	11.59	0.71
10.0000000	10.88	11.59	0.71
10.5000000	5.58	6.43	0.85
10.5000000	5.58	6.43	0.85
11.0000000	3.73	4.80	1.07
11.0000000	3.73	4.80	1.07
11.0769231	6.22	7.05	0.83
11.0769231	6.22	7.05	0.83
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
12.0000000	6.30	17.30	11.00
13.5000000	5.98	6.74	0.76
13.5000000	5.98	6.74	0.76
14.0000000	4.62	13.04	8.42
14.0000000	4.62	13.04	8.42
14.0000000	4.62	13.04	8.42
14.5000000	9.80	9.80	0.00
15.5000000	4.04	4.04	0.00
16.0000000	10.07	13.08	3.01
16.0000000	10.07	13.08	3.01
16.5000000	6.56	7.34	0.78
16.5000000	6.56	7.34	0.78
17.0000000	4.37	4.37	0.00
18.0000000	8.59	12.39	3.80
18.0000000	8.59	12.39	3.80
18.0000000	8.59	12.39	3.80
18.0000000	8.59	12.39	3.80
18.0000000	8.59	12.39	3.80
18.0000000	8.59	12.39	3.80
18.5000000	9.70	9.70	0.00
19.5000000	7.54	7.57	0.03
19.5000000	7.54	7.57	0.03
20.0000000	4.39	13.81	9.42
20.0000000	4.39	13.81	9.42
20.0000000	4.39	13.81	9.42
20.0000000	4.39	13.81	9.42
22.0000000	12.09	14.76	2.67
22.0000000	12.09	14.76	2.67
22.0000000	12.09	14.76	2.67
22.5000000	7.37	9.02	1.65
22.5000000	7.37	9.02	1.65
23.0000000	6.91	7.35	0.44
23.0000000	6.91	7.35	0.44
24.0000000	9.57	21.08	11.51
24.0000000	9.57	21.08	11.51
24.0000000	9.57	21.08	11.51
24.0000000	9.57	21.08	11.51
24.0000000	9.57	21.08	11.51
25.5000000	10.71	10.96	0.25
25.5000000	10.71	10.96	0.25
26.0000000	5.91	16.66	10.75
26.0000000	5.91	16.66	10.75
26.0000000	5.91	16.66	10.75
26.0000000	5.91	16.66	10.75
27.5000000	13.00	13.00	0.00
28.0000000	12.79	15.28	2.49
28.0000000	12.79	15.28	2.49
28.5000000	9.71	12.08	2.37
28.5000000	9.71	12.08	2.37
29.0000000	7.95	10.68	2.73
29.0000000	7.95	10.68	2.73
30.0000000	11.52	23.40	11.88
30.0000000	11.52	23.40	11.88
30.0000000	11.52	23.40	11.88
30.0000000	11.52	23.40	11.88
31.5000000	12.41	12.80	0.39
31.5000000	12.41	12.80	0.39
32.0000000	8.77	17.36	8.59
32.0000000	8.77	17.36	8.59
32.0000000	8.77	17.36	8.59
32.0000000	8.77	17.36	8.59
33.5000000	26.00	26.00	0.00
34.0000000	14.97	19.19	4.22
34.0000000	14.97	19.19	4.22
35.0000000	10.75	12.01	1.26
35.0000000	10.75	12.01	1.26
36.0000000	14.98	27.70	12.72
36.0000000	14.98	27.70	12.72
36.0000000	14.98	27.70	12.72
36.0000000	14.98	27.70	12.72
37.5000000	11.81	12.60	0.79
37.5000000	11.81	12.60	0.79
48.0000000	24.81	24.81	0.00

Code

binoc <- acuity_df |>
  dplyr::filter(binoc_monoc == "binoc")
monoc <- acuity_df |>
  dplyr::filter(binoc_monoc == "monoc")

lm_b <- lm(mean_acuity_cyc_deg ~ age_grp_rog, data = binoc)

lm_m <- lm(mean_acuity_cyc_deg ~ age_grp_rog, data = monoc)

summary(lm_b)


Call:
lm(formula = mean_acuity_cyc_deg ~ age_grp_rog, data = binoc)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0682 -1.3940 -0.5624  1.3503  9.9061 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.73305    0.46733   3.708 0.000337 ***
age_grp_rog  0.47174    0.02844  16.587  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.131 on 104 degrees of freedom
Multiple R-squared:  0.7257,    Adjusted R-squared:  0.723 
F-statistic: 275.1 on 1 and 104 DF,  p-value: < 2.2e-16

Code

summary(lm_m)


Call:
lm(formula = mean_acuity_cyc_deg ~ age_grp_rog, data = monoc)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2832 -1.5766 -0.1751  1.7396  7.1450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.16669    0.74315   2.916  0.00538 ** 
age_grp_rog  0.34717    0.03356  10.346 8.22e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.755 on 48 degrees of freedom
Multiple R-squared:  0.6904,    Adjusted R-squared:  0.684 
F-statistic:   107 on 1 and 48 DF,  p-value: 8.215e-14

By-individual data

The Gilmore lab has some archival data that we can potentially use in this project. The following represents Rick Gilmore’s work to gather, clean, and visualize these data.

Gathering

The de-identified archival data are stored in a Google sheet accessed by the lab Google account.

First, we must authenticate to Google to access the relevant file and download it.

Code

options(gargle_oauth_email = "psubrainlab@gmail.com")
googledrive::drive_auth()

Then we download the relevant file.

Code

googledrive::drive_download(
  "vep-session-log",
  path = file.path(params$data_dir, "by-participant.csv"),
  type = "csv",
  overwrite = TRUE
)

Auto-refreshing stale OAuth token.

File downloaded:

• 'vep-session-log' <id: 1Y9GoJU6EFUxcNmbWOLr52enpkuCyg_wSFv3pA-SWm4Q>

Saved locally as:

• 'data/csv/by-participant.csv'

Unlike the Google sheet newly created for the by-study data, this one requires a lot of cleaning.

Code

gilmore_archival_df <-
  readr::read_csv(file.path(params$data_dir, "by-participant.csv"),
                  show_col_types = FALSE)
names(gilmore_archival_df)

 [1] "Participant Number in an IRB year"  "IRB Count"                         
 [3] "Date"                               "Time"                              
 [5] "Session-Leader"                     "Sex"                               
 [7] "Participant-ID"                     "DOB"                               
 [9] "infant"                             "Stimset-name"                      
[11] "Backed up to project folder on Box" "Export-Status"                     
[13] "Snellen-Acuity"                     "Teller Acuity Cards"               
[15] "Stereo-Acuity"                      "Zipped on Drive"                   
[17] "Age at test"                        "keep-session"                      
[19] "Comments"

We’ll keep Date, Time, Sex, DOB, Teller Acuity Cards, Age at test.

Code

gilmore_archival_df <- gilmore_archival_df |>
  dplyr::select(Date, Time, Sex, DOB, `Teller Acuity Cards`, `Age at test`)

Then, let’s filter those where we have TAC data.

Code

with(gilmore_archival_df, unique(`Teller Acuity Cards`))

 [1] "20/94 6.4 cyc/deg @ 55cm"                                   
 [2] "20/190 3.1 cyc/deg @ 55cm"                                  
 [3] "20/470 1.3 cyc/deg @ 55cm"                                  
 [4] "20/380 1.6 cyc/deg @ 55cm - difficulty attending toward end"
 [5] "20/130 4.7 cyc/deg @ 55cm"                                  
 [6] NA                                                           
 [7] "not interested"                                             
 [8] "20/94 6.54 cyc/deg @ 55cm"                                  
 [9] "20/130 4/7 cyc/deg @ 55cm"                                  
[10] "20/94 6.40 cyc/deg @ 55cm"                                  
[11] "20/170 3.6 cyc/deg @84cm"                                   
[12] "20/190 3.1 cyc/deg @55cm"                                   
[13] "20/94 6.5 cyc/deg @ 55cm"

Code

gilmore_archival_df <- gilmore_archival_df |>
  dplyr::filter(!is.na(`Teller Acuity Cards`),
                `Teller Acuity Cards` != "not interested")

dim(gilmore_archival_df)

[1] 34  6

Note

This file illustrates how making data FAIR from the outset can save work.

This one is not too terribly hard to parse, but it could have been better planned.

We’ll extract the viewing distance with a regular expression.

Code

gilmore_archival_df <- gilmore_archival_df |>
  dplyr::mutate(view_dist_cm = stringr::str_extract(`Teller Acuity Cards`, "[0-9]{2}cm")) |>
  dplyr::mutate(view_dist_cm = stringr::str_remove(view_dist_cm, "cm")) # remove 'cm'
gilmore_archival_df$view_dist_cm

 [1] "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55"
[16] "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "55" "84"
[31] "55" "55" "55" "55"

Similarly, we’ll extract the acuity in cyc/deg using a regular expression.

Code

gilmore_archival_df <- gilmore_archival_df |>
  # add 'cyc' to separate cyc/deg from Snellen acuity
  dplyr::mutate(acuity_cyc_deg = stringr::str_extract(`Teller Acuity Cards`, "[0-9]{1}[\\./]{1}[0-9]+ cyc")) |>
  dplyr::mutate(acuity_cyc_deg = stringr::str_remove(acuity_cyc_deg, " cyc")) |>
  dplyr::mutate(acuity_cyc_deg = stringr::str_replace(acuity_cyc_deg, "/", "."))

gilmore_archival_df$acuity_cyc_deg

 [1] "6.4"  "3.1"  "1.3"  "3.1"  "3.1"  "1.6"  "4.7"  "3.1"  "3.1"  "3.1" 
[11] "6.4"  "4.7"  "3.1"  "4.7"  "3.1"  "4.7"  "4.7"  "6.4"  "4.7"  "4.7" 
[21] "6.54" "4.7"  "4.7"  "6.40" "4.7"  "4.7"  "4.7"  "3.1"  "3.1"  "3.6" 
[31] "4.7"  "3.1"  "6.5"  "3.1"

Now, let’s look at the age at test.

Code

gilmore_archival_df$`Age at test`

 [1] "0 Years, 6 Months, 25 Days" "0 Years, 5 Months, 28 Days"
 [3] "0 Years, 5 Months, 8 Days"  "0 Years, 6 Months, 14 Days"
 [5] "0 Years, 4 Months, 27 Days" "0 Years, 4 Months, 2 Days" 
 [7] "0 Years, 4 Months, 21 Days" "0 Years, 4 Months, 27 Days"
 [9] "0 Years, 3 Months, 7 Days"  "0 Years, 4 Months, 10 Days"
[11] "0 Years, 6 Months, 24 Days" "0 Years, 7 Months, 26 Days"
[13] "0 Years, 8 Months, 8 Days"  "0 Years, 8 Months, 21 Days"
[15] "0 Years, 8 Months, 2 Days"  "0 Years, 8 Months, 2 Days" 
[17] "0 Years, 7 Months, 18 Days" "0 Years, 7 Months, 27 Days"
[19] "0 Years, 8 Months, 7 Days"  "0 Years, 7 Months, 12 Days"
[21] "0 Years, 5 Months, 26 Days" "0 Years, 8 Months, 3 Days" 
[23] "0 Years, 5 Months, 18 Days" "0 Years, 6 Months, 25 Days"
[25] "0 Years, 8 Months, 14 Days" "0 Years, 4 Months, 17 Days"
[27] "0 Years, 8 Months, 7 Days"  "0 Years, 8 Months, 7 Days" 
[29] "0 Years, 8 Months, 0 Days"  "0 Years, 6 Months, 2 Days" 
[31] "0 Years, 7 Months, 29 Days" "0 Years, 7 Months, 13 Days"
[33] "0 Years, 6 Months, 5 Days"  "0 Years, 5 Months, 20 Days"

Instead, let’s see what it looks like to compute age at test from the dates.

Code

gilmore_archival_df <- gilmore_archival_df |>
  dplyr::mutate(age_at_test_days = lubridate::mdy(Date) - lubridate::mdy(DOB))

gilmore_archival_df$age_at_test_days

Time differences in days
 [1] 208 181 161 198 149 125 143 149  99 132 207 239 252 265 245 245 231 242 252
[20] 225 179 247 171 209 259 139 249 250 243 186 241 225 186 171

That seems reasonable for now.

Let’s see if we can plot these data.

Code

gilmore_archival_df |>
  dplyr::mutate(acuity_cyc_deg = as.numeric(acuity_cyc_deg)) |>
  ggplot() +
  aes(x = age_at_test_days, y = acuity_cyc_deg, color = Sex) +
  geom_point() +
  geom_smooth(method = "lm") +
  #theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank())

Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
`geom_smooth()` using formula = 'y ~ x'

Figure 4.2: Individual participant Teller Acuity Card thresholds from archival Gilmore lab data

Note

Before I stop, I’m going to add the by-participant data file to a .gitignore file, just to be extra careful.

Xiang, Y., Long, E., Liu, Z., Li, X., Lin, Z., Zhu, Y., … Lin, H. (2021). Study to establish visual acuity norms with teller acuity cards II for infants from southern china. Eye, 35(10), 2787–2792. https://doi.org/10.1038/s41433-020-01314-y