This is a dashboard for the data collection and cleaning process.

All code is “folded” by default. Select “Show All Code” from the menu at the upper right to reveal the code chunks.

This page was last rendered on 2024-04-18 11:25:27.935212.

Set-up

We load ggplot2 to make the following plot commands easier to type.

Code
library(ggplot2)

Download

The data are stored in a Google sheet that we download again if params$update_data == TRUE. Otherwise, we make use of a stored data file.

Code
if (!dir.exists(params$data_dir)) {
  message("Creating missing ",  params$data_dir, ".")
  dir.create(params$data_dir)
}

project_ss <- params$google_data_url

if (params$update_data) {
  if (params$use_sysenv_creds) {
    google_creds <- Sys.getenv("GMAIL_SURVEY")
    if (google_creds != "") {
      options(gargle_oauth_email = google_creds)
      googledrive::drive_auth()
    } else {
      message("No Google account information stored in `.Renviron`.")
      message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
    }
  }

  papers_data <- googlesheets4::read_sheet(ss = project_ss,
                            sheet = params$sheet_name)
  out_fn <- file.path(params$data_dir, params$data_fn)
  readr::write_csv(papers_data, out_fn)
  message("Data updated: ", out_fn)
} else {
  message("Using stored data.")
  papers_data <- readr::read_csv(file.path(params$data_dir, params$data_fn),
                                 show_col_types = FALSE)
}
Using stored data.

Synched Paperpile file from GitHub

We have configured Paperpile to synch a .bib formatted file directly with this repo on GitHub. The files can be found here: src/data/*paperpile*.bib.

We import data/paperpile-tac-has-pdf.bib and data/paperpile-tac-no-pdf.bib separately; add a variable indicating whether we have or do not have a PDF; then, join the two data data frames.

Code
refs_w_pdf <- bib2df::bib2df("data/paperpile-tac-has-pdf.bib", separate_names = TRUE)
Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.
Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
  tibble, or `as.data.frame()` to convert to a data frame.
ℹ The deprecated feature was likely used in the bib2df package.
  Please report the issue to the authors.
Column `YEAR` contains character strings.
              No coercion to numeric applied.
Code
refs_w_pdf <- refs_w_pdf |>
  dplyr::mutate(pdf = TRUE)

We have 314 papers with PDFs to process.

Code
refs_no_pdf <- bib2df::bib2df("data/paperpile-tac-no-pdf.bib", separate_names = TRUE)
Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.
Column `YEAR` contains character strings.
              No coercion to numeric applied.
Code
refs_no_pdf <- refs_no_pdf |>
  dplyr::mutate(pdf = FALSE)

We have 438 papers without PDFs to process. In a separate workflow, we will try to access these papers via the PSU Libraries and other sources.

Code
refs_all <- dplyr::full_join(refs_w_pdf, refs_no_pdf)
Joining with `by = join_by(CATEGORY, BIBTEXKEY, ADDRESS, ANNOTE, AUTHOR,
BOOKTITLE, CHAPTER, CROSSREF, EDITION, EDITOR, HOWPUBLISHED, INSTITUTION,
JOURNAL, KEY, MONTH, NOTE, NUMBER, ORGANIZATION, PAGES, PUBLISHER, SCHOOL,
SERIES, TITLE, TYPE, VOLUME, YEAR, URL, KEYWORDS, LANGUAGE, ISSN, PMID, DOI,
PMC, COPYRIGHT, ISBN, pdf)`

Clean

The author and editor fields are imported as lists. We need to merge these into character strings to re-import the data back into Google Sheets.

Code
# Create function to change AUTHOR list to a string array
make_author_list <- function(df) {
  unlist(df$full_name) |> paste(collapse = "; ")
}

make_editor_list <- function(df) {
  if (is.na(df$full_name)) {
    ""
  } else {
  unlist(df$full_name) |> paste(collapse = "; ")    
  }
}

authors_string <- purrr::map(refs_all$AUTHOR, make_author_list) |> 
  purrr::list_c()

editors_string <- purrr::map(refs_all$EDITOR, make_author_list) |> 
  purrr::list_c()

new_refs_all <- refs_all |>
  dplyr::mutate(authors = authors_string,
                editors = editors_string) |>
  dplyr::select(-c("AUTHOR", "ANNOTE", "EDITOR"))

Upload cleaned

We the push the cleaned data back to Google Sheets for further analysis and processing.

Warning

We do not push the cleaned data back to the original sheet but to a new one to avoid overwriting data.

Code
new_refs_all |> 
  googlesheets4::sheet_write(project_ss, "from_paperpile_via_github_cleaned")
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
  'rick.o.gilmore@gmail.com'.
✔ Writing to "Legacy Project Acuity Data: By Paper".
✔ Writing to sheet 'from_paperpile_via_github_cleaned'.

Visualize

Papers by publication date

The following uses the new has-pdf/no-pdf export workflow from Paperpile directly to Github.

Code
refs_all |>
  ggplot() +
  aes(x = YEAR, fill = pdf) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Figure 1: Papers by publication year

There are 752 papers in our Paperpile. Of these 314 have PDFs.

Extracted tables

This section extracts data about our progress in capturing data tables from these articles.

Code
img_folder <- googledrive::drive_find(type = "folder", q = "name contains 'legacy'")
ℹ Suitable tokens found in the cache, associated with these emails:
• 'psubrainlab@gmail.com'
• 'rick.o.gilmore@gmail.com'
  Defaulting to the first email.
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googledrive package is using a cached token for 'psubrainlab@gmail.com'.
Code
img_df <- googledrive::drive_ls(img_folder)

img_df <- img_df |>
  dplyr::mutate(paper_id = stringr::str_extract_all(name, "[a-zA-Z0-9]+\\-[a-z]{2}"))

Now that we have re-extracted the paper_id, we can do some summaries.

Code
n_tables <- dim(img_df)[1]
n_papers <- length(unique(img_df$paper_id))

We have processed 60 papers and 159 tables as of 2024-04-18 11:25:35.731348.

Papers entered by analyst

We use the from_paperpile_via_github tab to keep track of our work. So, we first import this sheet.

Code
papers_progress_data <- googlesheets4::read_sheet(ss = project_ss,
                            sheet = "from_paperpile_via_github")
✔ Reading from "Legacy Project Acuity Data: By Paper".
✔ Range ''from_paperpile_via_github''.

Here is a table of the papers processed by each analyst.

Code
xtabs(formula = ~ open_attempt_by, data = papers_progress_data)
open_attempt_by
bhb jmd nlc 
 53  20   4 

Here is a table of the number of captured figures:

Code
papers_progress_data |>
  dplyr::filter(!is.na(number_of_captured_figs),
                !is.na(open_attempt_by)) |>
  dplyr::group_by(open_attempt_by) |>
  dplyr::summarise(n_figs = sum(number_of_captured_figs)) |>
  knitr::kable("html")
open_attempt_by n_figs
bhb 110
jmd 50
nlc 1