Code
library(ggplot2)
This is a dashboard for the data collection and cleaning process.
All code is “folded” by default. Select “Show All Code” from the menu at the upper right to reveal the code chunks.
This page was last rendered on 2024-04-18 11:25:27.935212.
We load ggplot2
to make the following plot commands easier to type.
The data are stored in a Google sheet that we download again if params$update_data == TRUE
. Otherwise, we make use of a stored data file.
if (!dir.exists(params$data_dir)) {
message("Creating missing ", params$data_dir, ".")
dir.create(params$data_dir)
}
project_ss <- params$google_data_url
if (params$update_data) {
if (params$use_sysenv_creds) {
google_creds <- Sys.getenv("GMAIL_SURVEY")
if (google_creds != "") {
options(gargle_oauth_email = google_creds)
googledrive::drive_auth()
} else {
message("No Google account information stored in `.Renviron`.")
message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
}
}
papers_data <- googlesheets4::read_sheet(ss = project_ss,
sheet = params$sheet_name)
out_fn <- file.path(params$data_dir, params$data_fn)
readr::write_csv(papers_data, out_fn)
message("Data updated: ", out_fn)
} else {
message("Using stored data.")
papers_data <- readr::read_csv(file.path(params$data_dir, params$data_fn),
show_col_types = FALSE)
}
Using stored data.
We have configured Paperpile to synch a .bib formatted file directly with this repo on GitHub. The files can be found here: src/data/*paperpile*.bib
.
We import data/paperpile-tac-has-pdf.bib
and data/paperpile-tac-no-pdf.bib
separately; add a variable indicating whether we have or do not have a PDF; then, join the two data data frames.
Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.
Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
tibble, or `as.data.frame()` to convert to a data frame.
ℹ The deprecated feature was likely used in the bib2df package.
Please report the issue to the authors.
Column `YEAR` contains character strings.
No coercion to numeric applied.
We have 314 papers with PDFs to process.
Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.
Column `YEAR` contains character strings.
No coercion to numeric applied.
We have 438 papers without PDFs to process. In a separate workflow, we will try to access these papers via the PSU Libraries and other sources.
Joining with `by = join_by(CATEGORY, BIBTEXKEY, ADDRESS, ANNOTE, AUTHOR,
BOOKTITLE, CHAPTER, CROSSREF, EDITION, EDITOR, HOWPUBLISHED, INSTITUTION,
JOURNAL, KEY, MONTH, NOTE, NUMBER, ORGANIZATION, PAGES, PUBLISHER, SCHOOL,
SERIES, TITLE, TYPE, VOLUME, YEAR, URL, KEYWORDS, LANGUAGE, ISSN, PMID, DOI,
PMC, COPYRIGHT, ISBN, pdf)`
The author and editor fields are imported as lists. We need to merge these into character strings to re-import the data back into Google Sheets.
# Create function to change AUTHOR list to a string array
make_author_list <- function(df) {
unlist(df$full_name) |> paste(collapse = "; ")
}
make_editor_list <- function(df) {
if (is.na(df$full_name)) {
""
} else {
unlist(df$full_name) |> paste(collapse = "; ")
}
}
authors_string <- purrr::map(refs_all$AUTHOR, make_author_list) |>
purrr::list_c()
editors_string <- purrr::map(refs_all$EDITOR, make_author_list) |>
purrr::list_c()
new_refs_all <- refs_all |>
dplyr::mutate(authors = authors_string,
editors = editors_string) |>
dplyr::select(-c("AUTHOR", "ANNOTE", "EDITOR"))
We the push the cleaned data back to Google Sheets for further analysis and processing.
We do not push the cleaned data back to the original sheet but to a new one to avoid overwriting data.
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
'rick.o.gilmore@gmail.com'.
✔ Writing to "Legacy Project Acuity Data: By Paper".
✔ Writing to sheet 'from_paperpile_via_github_cleaned'.
The following uses the new has-pdf/no-pdf export workflow from Paperpile directly to Github.
There are 752 papers in our Paperpile. Of these 314 have PDFs.
This section extracts data about our progress in capturing data tables from these articles.
ℹ Suitable tokens found in the cache, associated with these emails:
• 'psubrainlab@gmail.com'
• 'rick.o.gilmore@gmail.com'
Defaulting to the first email.
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googledrive package is using a cached token for 'psubrainlab@gmail.com'.
Now that we have re-extracted the paper_id
, we can do some summaries.
We have processed 60 papers and 159 tables as of 2024-04-18 11:25:35.731348.
We use the from_paperpile_via_github
tab to keep track of our work. So, we first import this sheet.
✔ Reading from "Legacy Project Acuity Data: By Paper".
✔ Range ''from_paperpile_via_github''.
Here is a table of the papers processed by each analyst.
open_attempt_by
bhb jmd nlc
53 20 4
Here is a table of the number of captured figures:
---
title: "Dashboard"
format:
html:
code-fold: true
code-tools: true
toc: true
params:
data_dir: "data/csv"
update_data: FALSE
use_sysenv_creds: TRUE
google_data_url: "https://docs.google.com/spreadsheets/d/1UFZkbh9oU4JHpYsrkDQcNmDyqD4J-qB74dhyMzIkqKs/edit?usp=sharing"
sheet_name: "paper_data"
data_fn: "paper-sources.csv"
---
This is a dashboard for the data collection and cleaning process.
All code is "folded" by default.
Select "Show All Code" from the menu at the upper right to reveal the code chunks.
This page was last rendered on `{r} Sys.time()`.
## Set-up
We load `ggplot2` to make the following plot commands easier to type.
```{r}
library(ggplot2)
```
## Download
The data are stored in a Google sheet that we download again if `params$update_data == TRUE`.
Otherwise, we make use of a stored data file.
```{r}
if (!dir.exists(params$data_dir)) {
message("Creating missing ", params$data_dir, ".")
dir.create(params$data_dir)
}
project_ss <- params$google_data_url
if (params$update_data) {
if (params$use_sysenv_creds) {
google_creds <- Sys.getenv("GMAIL_SURVEY")
if (google_creds != "") {
options(gargle_oauth_email = google_creds)
googledrive::drive_auth()
} else {
message("No Google account information stored in `.Renviron`.")
message("Add authorized Google account name to `.Renviron` using `usethist::edit_r_environ()`.")
}
}
papers_data <- googlesheets4::read_sheet(ss = project_ss,
sheet = params$sheet_name)
out_fn <- file.path(params$data_dir, params$data_fn)
readr::write_csv(papers_data, out_fn)
message("Data updated: ", out_fn)
} else {
message("Using stored data.")
papers_data <- readr::read_csv(file.path(params$data_dir, params$data_fn),
show_col_types = FALSE)
}
```
### Synched Paperpile file from GitHub
We have configured Paperpile to synch a .bib formatted file directly with this repo on GitHub.
The files can be found here: `src/data/*paperpile*.bib`.
We import `data/paperpile-tac-has-pdf.bib` and `data/paperpile-tac-no-pdf.bib` separately; add a variable indicating whether we have or do not have a PDF; then, join the two data data frames.
```{r}
refs_w_pdf <- bib2df::bib2df("data/paperpile-tac-has-pdf.bib", separate_names = TRUE)
refs_w_pdf <- refs_w_pdf |>
dplyr::mutate(pdf = TRUE)
```
We have `{r} dim(refs_w_pdf)[1]` papers with PDFs to process.
```{r}
refs_no_pdf <- bib2df::bib2df("data/paperpile-tac-no-pdf.bib", separate_names = TRUE)
refs_no_pdf <- refs_no_pdf |>
dplyr::mutate(pdf = FALSE)
```
We have `{r} dim(refs_no_pdf)[1]` papers *without* PDFs to process.
In a separate workflow, we will try to access these papers via the PSU Libraries and other sources.
```{r}
refs_all <- dplyr::full_join(refs_w_pdf, refs_no_pdf)
```
## Clean
The author and editor fields are imported as lists.
We need to merge these into character strings to re-import the data back into Google Sheets.
```{r}
# Create function to change AUTHOR list to a string array
make_author_list <- function(df) {
unlist(df$full_name) |> paste(collapse = "; ")
}
make_editor_list <- function(df) {
if (is.na(df$full_name)) {
""
} else {
unlist(df$full_name) |> paste(collapse = "; ")
}
}
authors_string <- purrr::map(refs_all$AUTHOR, make_author_list) |>
purrr::list_c()
editors_string <- purrr::map(refs_all$EDITOR, make_author_list) |>
purrr::list_c()
new_refs_all <- refs_all |>
dplyr::mutate(authors = authors_string,
editors = editors_string) |>
dplyr::select(-c("AUTHOR", "ANNOTE", "EDITOR"))
```
## Upload cleaned
We the push the cleaned data back to Google Sheets for further analysis and processing.
::: {.callout-warning}
We do not push the cleaned data back to the original sheet but to a new one to avoid overwriting data.
:::
```{r}
new_refs_all |>
googlesheets4::sheet_write(project_ss, "from_paperpile_via_github_cleaned")
```
## Visualize
### Papers by publication date
The following uses the new has-pdf/no-pdf export workflow from Paperpile directly to Github.
```{r}
#| label: fig-papers-by-pub-year-paperpile
#| fig-cap: "Papers by publication year"
refs_all |>
ggplot() +
aes(x = YEAR, fill = pdf) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
There are `{r} dim(refs_all)[1]` papers in our Paperpile. Of these `{r} sum(refs_all$pdf)` have PDFs.
## Extracted tables
This section extracts data about our progress in capturing data tables from these articles.
```{r}
img_folder <- googledrive::drive_find(type = "folder", q = "name contains 'legacy'")
img_df <- googledrive::drive_ls(img_folder)
img_df <- img_df |>
dplyr::mutate(paper_id = stringr::str_extract_all(name, "[a-zA-Z0-9]+\\-[a-z]{2}"))
```
Now that we have re-extracted the `paper_id`, we can do some summaries.
```{r}
n_tables <- dim(img_df)[1]
n_papers <- length(unique(img_df$paper_id))
```
We have processed `{r} n_papers` papers and `{r} n_tables` tables as of `{r} Sys.time()`.
### Papers entered by analyst
We use the `from_paperpile_via_github` tab to keep track of our work.
So, we first import this sheet.
```{r}
papers_progress_data <- googlesheets4::read_sheet(ss = project_ss,
sheet = "from_paperpile_via_github")
```
Here is a table of the papers processed by each analyst.
```{r}
xtabs(formula = ~ open_attempt_by, data = papers_progress_data)
```
Here is a table of the number of captured figures:
```{r}
papers_progress_data |>
dplyr::filter(!is.na(number_of_captured_figs),
!is.na(open_attempt_by)) |>
dplyr::group_by(open_attempt_by) |>
dplyr::summarise(n_figs = sum(number_of_captured_figs)) |>
knitr::kable("html")
```
<!-- ### Papers by open status -->
<!-- #### On PSU network -->
<!-- ```{r} -->
<!-- #| label: fig-openable-papers -->
<!-- #| fig-cap: "Papers with openable URLs" -->
<!-- papers_data |> -->
<!-- ggplot() + -->
<!-- aes(x = url_openable) + -->
<!-- geom_bar() -->
<!-- ``` -->
<!-- #### Access via PSU Libraries -->