Code
library(ggplot2)
This is a dashboard for the data collection and cleaning process.
All code is “folded” by default. Select “Show All Code” from the menu at the upper right to reveal the code chunks.
This page was last rendered on 2024-12-04 15:01:48.706818.
We load ggplot2
to make the following plot commands easier to type.
library(ggplot2)
The data are stored in a Google sheet that we download again if params$update_data == TRUE
. Otherwise, we make use of a stored data file.
if (!dir.exists(params$data_dir)) {
message("Creating missing ", params$data_dir, ".")
dir.create(params$data_dir)
}
<- params$google_data_url
project_ss
if (params$update_data) {
if (params$use_sysenv_creds) {
<- Sys.getenv("GMAIL_SURVEY")
google_creds if (google_creds != "") {
options(gargle_oauth_email = google_creds)
::drive_auth()
googledriveelse {
} message("No Google account information stored in `.Renviron`.")
message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
}
}
<- googlesheets4::read_sheet(ss = project_ss,
papers_data sheet = params$sheet_name)
<- file.path(params$data_dir, params$data_fn)
out_fn ::write_csv(papers_data, out_fn)
readrmessage("Data updated: ", out_fn)
else {
} message("Using stored data.")
<- readr::read_csv(file.path(params$data_dir, params$data_fn),
papers_data show_col_types = FALSE)
}
✔ Reading from "Legacy Project Acuity Data: By Paper".
✔ Range ''paper_data''.
Data updated: data/csv/paper-sources.csv
We have configured Paperpile to synch a .bib formatted file directly with this repo on GitHub. The files can be found here: src/data/*paperpile*.bib
.
We import data/paperpile-tac-has-pdf.bib
and data/paperpile-tac-no-pdf.bib
separately; add a variable indicating whether we have or do not have a PDF; then, join the two data data frames.
<- bib2df::bib2df("data/paperpile-tac-has-pdf.bib", separate_names = TRUE) refs_w_pdf
Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.
Column `YEAR` contains character strings.
No coercion to numeric applied.
<- refs_w_pdf |>
refs_w_pdf ::mutate(pdf = TRUE) dplyr
We have 316 papers with PDFs to process.
<- bib2df::bib2df("data/paperpile-tac-no-pdf.bib", separate_names = TRUE) refs_no_pdf
Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.
Column `YEAR` contains character strings.
No coercion to numeric applied.
<- refs_no_pdf |>
refs_no_pdf ::mutate(pdf = FALSE) dplyr
We have 429 papers without PDFs to process. In a separate workflow, we will try to access these papers via the PSU Libraries and other sources.
<- dplyr::full_join(refs_w_pdf, refs_no_pdf) refs_all
Joining with `by = join_by(CATEGORY, BIBTEXKEY, ADDRESS, ANNOTE, AUTHOR,
BOOKTITLE, CHAPTER, CROSSREF, EDITION, EDITOR, HOWPUBLISHED, INSTITUTION,
JOURNAL, KEY, MONTH, NOTE, NUMBER, ORGANIZATION, PAGES, PUBLISHER, SCHOOL,
SERIES, TITLE, TYPE, VOLUME, YEAR, JOURNALTITLE, ISSUE, DATE, DOI, PMID, ISSN,
URL, LANGUAGE, URLDATE, KEYWORDS, PMC, LOCATION, ISBN, ORIGTITLE, pdf)`
The author and editor fields are imported as lists. We need to merge these into character strings to re-import the data back into Google Sheets.
# Create function to change AUTHOR list to a string array
<- function(df) {
make_author_list unlist(df$full_name) |> paste(collapse = "; ")
}
<- function(df) {
make_editor_list if (is.na(df$full_name)) {
""
else {
} unlist(df$full_name) |> paste(collapse = "; ")
}
}
<- purrr::map(refs_all$AUTHOR, make_author_list) |>
authors_string ::list_c()
purrr
<- purrr::map(refs_all$EDITOR, make_author_list) |>
editors_string ::list_c()
purrr
<- refs_all |>
refs_all ::mutate(YEAR2 = stringr::str_extract(DATE, "^[0-9]{4}"))
dplyr
<- refs_all |>
new_refs_all ::mutate(authors = authors_string,
dplyreditors = editors_string) |>
::select(-c("AUTHOR", "ANNOTE", "EDITOR")) dplyr
We the push the cleaned data back to Google Sheets for further analysis and processing.
We do not push the cleaned data back to the original sheet but to a new one to avoid overwriting data.
|>
new_refs_all ::sheet_write(project_ss, "from_paperpile_via_github_cleaned") googlesheets4
✔ Writing to "Legacy Project Acuity Data: By Paper".
✔ Writing to sheet 'from_paperpile_via_github_cleaned'.
The following uses the new has-pdf/no-pdf export workflow from Paperpile directly to Github.
|>
refs_all ::filter(!is.na(YEAR2)) |>
dplyrggplot() +
aes(x = YEAR2, fill = pdf) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
There are 745 papers in our Paperpile. Of these 316 have PDFs.
This section extracts data about our progress in capturing data tables from these articles.
<- googledrive::drive_find(type = "folder", q = "name contains 'legacy'")
img_folder <- googledrive::drive_ls(img_folder)
img_df
<- img_df |>
img_df ::mutate(paper_id = stringr::str_extract_all(name, "[a-zA-Z0-9]+\\-[a-z]{2}")) dplyr
Now that we have re-extracted the paper_id
, we can do some summaries.
<- dim(img_df)[1]
n_tables <- length(unique(img_df$paper_id)) n_papers
We have processed 84 papers and 230 tables as of 2024-12-04 15:01:56.531004.
We use the from_paperpile_via_github
tab to keep track of our work. So, we first import this sheet.
<- googlesheets4::read_sheet(ss = project_ss,
papers_progress_data sheet = "from_paperpile_via_github")
✔ Reading from "Legacy Project Acuity Data: By Paper".
✔ Range ''from_paperpile_via_github''.
New names:
• `` -> `...1`
Here is a table of the papers processed by each analyst.
xtabs(formula = ~ open_attempt_by, data = papers_progress_data)
open_attempt_by
ars bhb hal jmd nlc sh trw
10 74 13 27 6 1 32
Here is a table of the number of captured figures:
As of 2024-09-05, we do not render this table because number_of_captured_figs
is a non-numeric list.
# 2024-09-05 do not evaluate because number_of_captured_figs is a non-numeric list
|>
papers_progress_data ::filter(!is.na(number_of_captured_figs),
dplyr!is.na(open_attempt_by)) |>
::group_by(open_attempt_by) |>
dplyr::summarise(n_figs = sum(number_of_captured_figs)) |>
dplyr::kable("html") knitr