Dashboard

Modified

December 4, 2024

This is a dashboard for the data collection and cleaning process.

All code is “folded” by default. Select “Show All Code” from the menu at the upper right to reveal the code chunks.

This page was last rendered on 2025-04-22 16:35:26.795147.

Set-up

We load ggplot2 to make the following plot commands easier to type.

Code

library(ggplot2)

Download

The data are stored in a Google sheet that we download again if params$update_data == TRUE. Otherwise, we make use of a stored data file.

Code

if (!dir.exists(params$data_dir)) {
  message("Creating missing ",  params$data_dir, ".")
  dir.create(params$data_dir)
}

project_ss <- params$google_data_url

if (params$update_data) {
  if (params$use_sysenv_creds) {
    google_creds <- Sys.getenv("GMAIL_SURVEY")
    if (google_creds != "") {
      options(gargle_oauth_email = google_creds)
      googledrive::drive_auth()
    } else {
      message("No Google account information stored in `.Renviron`.")
      message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
    }
  }

  papers_data <- googlesheets4::read_sheet(ss = project_ss,
                            sheet = params$sheet_name)
  out_fn <- file.path(params$data_dir, params$data_fn)
  readr::write_csv(papers_data, out_fn)
  message("Data updated: ", out_fn)
} else {
  message("Using stored data.")
  papers_data <- readr::read_csv(file.path(params$data_dir, params$data_fn),
                                 show_col_types = FALSE)
}

✔ Reading from "Legacy Project Acuity Data: By Paper".

✔ Range ''paper_data''.

Data updated: data/csv/paper-sources.csv

Synched Paperpile file from GitHub

We have configured Paperpile to synch a .bib formatted file directly with this repo on GitHub. The files can be found here: src/data/*paperpile*.bib.

We import data/paperpile-tac-has-pdf.bib and data/paperpile-tac-no-pdf.bib separately; add a variable indicating whether we have or do not have a PDF; then, join the two data data frames.

Code

refs_w_pdf <- bib2df::bib2df("data/paperpile-tac-has-pdf.bib", separate_names = TRUE)

Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.

Column `YEAR` contains character strings.
              No coercion to numeric applied.

Code

refs_w_pdf <- refs_w_pdf |>
  dplyr::mutate(pdf = TRUE)

We have 432 papers with PDFs to process.

Code

refs_no_pdf <- bib2df::bib2df("data/paperpile-tac-no-pdf.bib", separate_names = TRUE)

Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.

Column `YEAR` contains character strings.
              No coercion to numeric applied.

Code

refs_no_pdf <- refs_no_pdf |>
  dplyr::mutate(pdf = FALSE)

We have 421 papers without PDFs to process. In a separate workflow, we will try to access these papers via the PSU Libraries and other sources.

Code

refs_all <- dplyr::full_join(refs_w_pdf, refs_no_pdf)

Joining with `by = join_by(CATEGORY, BIBTEXKEY, ADDRESS, ANNOTE, AUTHOR,
BOOKTITLE, CHAPTER, CROSSREF, EDITION, EDITOR, HOWPUBLISHED, INSTITUTION,
JOURNAL, KEY, MONTH, NOTE, NUMBER, ORGANIZATION, PAGES, PUBLISHER, SCHOOL,
SERIES, TITLE, TYPE, VOLUME, YEAR, JOURNALTITLE, ISSUE, DATE, DOI, PMID, ISSN,
URL, LANGUAGE, KEYWORDS, PMC, URLDATE, ORIGTITLE, LOCATION, EVENTTITLE, VENUE,
ISBN, PAGETOTAL, pdf)`

Clean

The author and editor fields are imported as lists. We need to merge these into character strings to re-import the data back into Google Sheets.

Code

# Create function to change AUTHOR list to a string array
make_author_list <- function(df) {
  unlist(df$full_name) |> paste(collapse = "; ")
}

make_editor_list <- function(df) {
  if (is.na(df$full_name)) {
    ""
  } else {
  unlist(df$full_name) |> paste(collapse = "; ")    
  }
}

authors_string <- purrr::map(refs_all$AUTHOR, make_author_list) |> 
  purrr::list_c()

editors_string <- purrr::map(refs_all$EDITOR, make_author_list) |> 
  purrr::list_c()

refs_all <- refs_all |>
    dplyr::mutate(YEAR2 = stringr::str_extract(DATE, "^[0-9]{4}"))

new_refs_all <- refs_all |>
  dplyr::mutate(authors = authors_string,
                editors = editors_string) |>
  dplyr::select(-c("AUTHOR", "ANNOTE", "EDITOR"))

Upload cleaned

We the push the cleaned data back to Google Sheets for further analysis and processing.

Warning

We do not push the cleaned data back to the original sheet but to a new one to avoid overwriting data.

Code

new_refs_all |> 
  googlesheets4::sheet_write(project_ss, "from_paperpile_via_github_cleaned")

✔ Writing to "Legacy Project Acuity Data: By Paper".

✔ Writing to sheet 'from_paperpile_via_github_cleaned'.

Visualize

Papers by publication date

The following uses the new has-pdf/no-pdf export workflow from Paperpile directly to Github.

Code

refs_all |>
  dplyr::filter(!is.na(YEAR2)) |>
  ggplot() +
  aes(x = YEAR2, fill = pdf) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

There are 853 papers in our Paperpile. Of these 432 have PDFs.

Extracted tables

This section extracts data about our progress in capturing data tables from these articles.

Code

img_folder <- googledrive::drive_find(type = "folder", q = "name contains 'legacy'")

Auto-refreshing stale OAuth token.

Code

img_df <- googledrive::drive_ls(img_folder)

img_df <- img_df |>
  dplyr::mutate(paper_id = stringr::str_extract_all(name, "[a-zA-Z0-9]+\\-[a-z]{2}"))

Now that we have re-extracted the paper_id, we can do some summaries.

Code

n_tables <- dim(img_df)[1]
n_papers <- length(unique(img_df$paper_id))

We have processed 84 papers and 230 tables as of 2025-04-22 16:35:35.050994.

Papers entered by analyst

We use the from_paperpile_via_github tab to keep track of our work. So, we first import this sheet.

Code

papers_progress_data <- googlesheets4::read_sheet(ss = project_ss,
                            sheet = "from_paperpile_via_github")

✔ Reading from "Legacy Project Acuity Data: By Paper".

✔ Range ''from_paperpile_via_github''.

New names:
• `` -> `...1`
• `` -> `...16`

Here is a table of the papers processed by each analyst.

Code

xtabs(formula = ~ open_attempt_by, data = papers_progress_data)

open_attempt_by
ars bhb hal jmd kgm mro nlc rog  sh trw 
 10  73  17  27   1   2   5  12   2  43

Here is a table of the number of captured figures:

Warning

As of 2024-09-05, we do not render this table because number_of_captured_figs is a non-numeric list.

Code

# 2024-09-05 do not evaluate because number_of_captured_figs is a non-numeric list
papers_progress_data |>
  dplyr::filter(!is.na(number_of_captured_figs),
                !is.na(open_attempt_by)) |>
  dplyr::group_by(open_attempt_by) |>
  dplyr::summarise(n_figs = sum(number_of_captured_figs)) |>
  knitr::kable("html")