Building a drop-in replacement for the Google Scholar page with R and OpenAlex

LEONARD BLASCHEK

Google Scholar is not perfect. Some might even call it worse:

It is also an opaque outgrowth of a for-profit tech-giant that one (read ‘I’) might want to spend less time with. Even beyond such idealistic musings, the lack of polish and development of Google Scholar makes its status as a side project quite apparent. Who knows how long it’ll be around in it’s current, useful form.

Thankfully, there is an alternative: OpenAlex is a relatively new, open-source scholarly indexing project. It contains all the data we need to build a drop-in replacement of the Google Scholar page:

The quality of its database and author disambiguation has been rapidly improving over recent months. By now, I believe it’s good enough that this blog post has merit, and I am hopeful that it is going to get even better.

1. Get your author and works objects

OpenAlex disambiguates authors using their own internal ID. However, they also link existing ORCIDs, so you can use that to quickly access your OpenAlex profile:

library(tidyverse)
library(jsonlite)
library(kableExtra)
library(rorcid)
require(WikidataR)

orcid <- "0000-0003-3943-1476"

full_profile <- fromJSON(paste0("https://api.openalex.org/authors/orcid:", orcid))

The output is a JSON file with summarising information about institutions, works and citations. It also contains an API URL to all the works associated the selected author. The default works API call returns only 25 results per page. We can increase that to a maximum of 200 by adding &per-page=200 to the call. For the excessively productive (or old — apologies) among you, we can return more than 200 works with a short loop:

full_works <- fromJSON(
  paste0(
    full_profile[["works_api_url"]], "&per-page=200&page=1&sort=publication_date:desc"
  )
)

total_pages <- ceiling(full_works[["meta"]][["count"]] / 200)

if (total_pages > 1) {
  for (i in c(1:total_pages)) {
    full_works$results <- full_works$results |>
      bind_rows(
        fromJSON(
          paste0(
            full_profile[["works_api_url"]], "&per-page=200&page=", i, "&sort=publication_date:desc"
          )
        )$results
      )
  }
}

2. Listing publications

The works JSON is remarkably detailed and, thankfully, regular. Plucking out the bits we need is straightforward. The data below is pretty self-explanatory, with one exception. primary_location.version describes the “best” accessible version of the article. For pre-publication peer-reviewed articles in traditional journals, is returns publishedVersion, for preprints it returns submittedVersion, allowing us to distinguish those two article types.

works <- tibble(
  authorship = full_works[["results"]][["authorships"]], # extract authors
  authors_long = map(authorship, c(2, 2)), # extract author names
  authors_short = map(authors_long, WikidataR::initials), # shorten first names
  author_list = map(authors_short, \(x) str_c(x, collapse = ", ")), # collapse authors into a list
  title = full_works[["results"]][["title"]], # extract titles
  doi = full_works[["results"]][["doi"]], # extract DOIs
  journal = full_works[["results"]][["primary_location"]][["source"]][["display_name"]], # extract journal names
  version = full_works[["results"]][["primary_location"]][["version"]], # extract "best" accessible versions
  type = case_when(version == "submittedVersion" ~ "Preprint", TRUE ~ "Journal"), # distinguish preprints
  pub_year = full_works[["results"]][["publication_year"]], # extract online publication years
  date = ymd(full_works[["results"]][["publication_date"]]), # extract online publication dates
  cited_by = full_works[["results"]][["cited_by_count"]], # extract citation counts
  cites_by_year = full_works[["results"]][["counts_by_year"]], # extract citations by year
)

Then we can paste together full citations (I leave out volume, issue and page numbers here, but you can easily add those in the same way):

works <- works |>
  mutate(
    item = paste0( # paste together the full citations
      author_list, ". ",
      pub_year, ". ",
      title, ". ",
      "<i>", journal, ".</i> ",
      "[", str_remove(doi, fixed("https://doi.org/")), "](", doi, ")"
    )
  )

In my publication list there are two theses and one correction. I’ll filter those out.

works <- works |>
  drop_na(doi) |> # drop items without a doi
  arrange(date) |> # arrange by date so we drop the right items in the next line
  distinct(title, journal, .keep_all = TRUE) # drop later items with duplicate names (e.g. corrections)

Now we just need to highlight the name of the selected author in each publication, sort the list and make a table

short_name <- WikidataR::initials(full_profile[["display_name"]])

work_table <- works |>
  # find author name and put it in bold face
  mutate(item = str_replace(item, short_name, paste0("<b>", short_name, "</b>"))) |>
  arrange(desc(date)) |>
  select(
    "Reference" = item,
    "Citations" = cited_by,
    "Year" = pub_year,
    type
  )

Peer-reviewed

work_table |>
  filter(type == "Journal") |>
  select(-type) |>
  kable(
    escape = FALSE
  ) |>
  kable_styling(
    full_width = FALSE
  )

Reference	Citations	Year
E Pesquet, L Blaschek, J Takahashi, M Yamamoto, A Champagne, Nuoendagula, E Subbotina, C Dimotakis, Z Bascik, S Kajita. 2023. Bulk and In Situ Quantification of Coniferaldehyde Residues in Lignin. Methods in molecular biology. 10.1007/978-1-0716-3477-6_14	0	2023
G Pedersen, L Blaschek, K Frandsen, L Noack, S Persson. 2023. Cellulose synthesis in land plants. Molecular Plant. 10.1016/j.molp.2022.12.015	8	2023
L Blaschek, E Murozuka, H Serk, D Ménard, E Pesquet. 2022. Different combinations of laccase paralogs nonredundantly control the amount and composition of lignin in specific cell types and cell wall layers in Arabidopsis. The Plant Cell. 10.1093/plcell/koac344	12	2022
D Ménard, L Blaschek, K Kriechbaum, C Lee, H Serk, C Zhu, A Lyubartsev, Nuoendagula, Z Bacsik, L Bergström, A Mathew, S Kajita, E Pesquet. 2022. Plant biomechanics and resilience to environmental changes are controlled by specific lignin chemistries in each vascular cell type and morphotype. The Plant Cell. 10.1093/plcell/koac284	7	2022
L Blaschek, E Pesquet. 2021. Phenoloxidases in Plants—How Structural Diversity Enables Functional Specificity. Frontiers in Plant Science. 10.3389/fpls.2021.754601	22	2021
M Yamamoto, L Blaschek, E Subbotina, S Kajita, E Pesquet. 2020. Importance of Lignin Coniferaldehyde Residues for Plant Properties and Sustainable Uses. ChemSusChem. 10.1002/cssc.202001242	11	2020
L Blaschek, Nuoendagula, Z Bacsik, S Kajita, E Pesquet. 2020. Determining the Genetic Regulation and Coordination of Lignification in Stem Tissues of Arabidopsis Using Semiquantitative Raman Microspectroscopy. ACS Sustainable Chemistry & Engineering. 10.1021/acssuschemeng.0c00194	15	2020
L Blaschek, A Champagne, C Dimotakis, Nuoendagula, R Decou, S Hishiyama, S Kratzer, S Kajita, E Pesquet. 2020. Cellular and Genetic Regulation of Coniferaldehyde Incorporation in Lignin of Herbaceous and Woody Plants by Quantitative Wiesner Staining. Frontiers in Plant Science. 10.3389/fpls.2020.00109	22	2020

Preprints

work_table |>
  filter(type == "Preprint") |>
  select(-type) |>
  kable(
    escape = FALSE
  ) |>
  kable_styling(
    full_width = FALSE
  )

Reference	Citations	Year
L Blaschek, E Murozuka, D Ménard, E Pesquet. 2022. Different combinations of laccase paralogs non-redundantly control the lignin amount and composition of specific cell types and cell wall layers in Arabidopsis. bioRxiv (Cold Spring Harbor Laboratory). 10.1101/2022.05.04.490011	1	2022
D Ménard, L Blaschek, K Kriechbaum, C Lee, H Serk, C Zhu, A Lyubartsev, Nuoendagula, Z Bacsik, L Bergström, A Mathew, S Kajita, E Pesquet. 2021. Specific and dynamic lignification at the cell-type level controls plant physiology and adaptability. bioRxiv (Cold Spring Harbor Laboratory). 10.1101/2021.06.12.447240	3	2021

3. Plot trends

Now that we have our publication list, we can add some figures to plot output and citations over the years. Let’s start with cumulative articles by years, split into peer-reviewed articles and preprints:

work_timeline <- works |>
  select(doi, pub_year, type) |>
  count(pub_year, type, name = "works") |>
  complete(pub_year, type, fill = list(works = 0)) |>
  group_by(type) |>
  arrange(pub_year) |>
  mutate(works_cum = cumsum(works))

ggplot(
  work_timeline,
  aes(
    x = pub_year,
    y = works_cum,
    colour = type
  )
) +
  geom_line() +
  geom_point(
    shape = 21,
    size = 2,
    stroke = 0.8,
    fill = "white"
  ) +
  annotate(
    "text",
    label = "Works — cumulative",
    x = min(work_timeline$pub_year) - 0.5,
    y = max(work_timeline$works_cum),
    size = 14 / (14 / 5),
    hjust = 0,
    vjust = 0,
    colour = "black"
  ) +
  scale_colour_manual(values = c("#275d95", "#e8c245")) +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 14) +
  theme(
    legend.title = element_blank(),
    axis.title = element_blank(),
    legend.position = c(0.2, 0.8),
    legend.text = element_text(colour = "black")
  )

And then, emulating the Google Scholar page, yearly citations:

cite_timeline <- tibble(
  year = full_profile[["counts_by_year"]][["year"]],
  cited_by_count = full_profile[["counts_by_year"]][["cited_by_count"]]
) |>
  arrange(year)

ggplot(
  cite_timeline,
  aes(
    x = year,
    y = cited_by_count
  )
) +
  geom_col(
    fill = "#e8c245"
  ) +
  annotate(
    "text",
    label = "Citations — yearly",
    x = min(cite_timeline$year) - 0.5,
    y = max(cite_timeline$cited_by_count),
    size = 14 / (14 / 5),
    hjust = 0,
    vjust = 0,
    colour = "black"
  ) +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 14) +
  theme(axis.title = element_blank())

4. Metrics

Lastly, we can add some metrics. We already have our total citation counts. To pull even with the Google Scholar page, let’s add the data within the last five years. We’ll need to manually calculate the h-index in that timespan, so we’ll define a little function for that:

h_index <- function(cites) {
  if (max(cites) == 0) {
    return(0)
  }
  cites <- cites[order(cites, decreasing = TRUE)]
  tail(which(cites >= seq_along(cites)), 1)
}

cites_fiveyear <- works |>
  select(doi, pub_year, cites_by_year) |>
  unnest(cites_by_year) |>
  filter(year > year(now()) - 5) |>
  group_by(doi) |>
  summarise(cited_by_count = sum(cited_by_count))

h_fiveyear <- h_index(cites_fiveyear$cited_by_count)

Bonus: completed peer reviews

Going beyond both OpenAlex and Google Scholar, we can also include verified peer-reviews, as long as they appear in your ORCID. You’ll need to set up ORCID authorization, check ?orcid_auth for a how-to.

full_peer_reviews <- orcid_peer_reviews(orcid)[[orcid]][["group"]][["peer-review-group"]] |>
  unlist()

peer_reviews <- tibble(
  review = full_peer_reviews[grepl(
    "peer-review-summary.external-ids.external-id.external-id-value", names(full_peer_reviews)
  )],
  year = full_peer_reviews[grepl(
    "peer-review-summary.completion-date.year.value", names(full_peer_reviews)
  )]
) |>
  summarise(count = n())

peer_reviews_fiveyear <- tibble(
  review = full_peer_reviews[grepl(
    "peer-review-summary.external-ids.external-id.external-id-value", names(full_peer_reviews)
  )],
  year = full_peer_reviews[grepl(
    "peer-review-summary.completion-date.year.value", names(full_peer_reviews)
  )]
) |>
  filter(year > year(now()) - 5) |>
  summarise(count = n())

Finally, let’s compile all those values into one tibble and plot them (we could use kable here, too, but I prefer the control of doing it in ggplot2):

metrics <- tibble(
  metric = ordered(
    rep(c("Citations", "h index", "i10 index", "Journal articles", "Preprints", "Peer reviews"), 2),
    levels = rev(c("Citations", "h index", "i10 index", "Journal articles", "Preprints", "Peer reviews"))
  ),
  span = ordered(
    c(rep(c("Total", paste0("Since ", year(now()) - 5)), each = 6)),
    levels = c("Total", paste0("Since ", year(now()) - 5))
  ),
  value = c(
    # total
    sum(cite_timeline$cited_by_count),
    full_profile[["summary_stats"]][["h_index"]],
    full_profile[["summary_stats"]][["i10_index"]],
    max(work_timeline$works_cum[work_timeline$type == "Journal"]),
    max(work_timeline$works_cum[work_timeline$type == "Preprint"]),
    peer_reviews$count,
    # five-year
    sum(cite_timeline$cited_by_count[cite_timeline$year > year(now()) - 5]),
    h_fiveyear,
    length(cites_fiveyear$cited_by_count[cites_fiveyear$cited_by_count >= 10]),
    sum(work_timeline$works[cite_timeline$year > year(now()) - 5 & work_timeline$type == "Journal"]),
    sum(work_timeline$works[cite_timeline$year > year(now()) - 5 & work_timeline$type == "Preprint"]),
    peer_reviews_fiveyear$count
  )
)

ggplot(
  metrics,
  aes(
    x = metric,
    y = span,
    label = value
  )
) +
  geom_text(
    size = (14 / (14 / 5)),
    hjust = 1,
    colour = "black"
  ) +
  geom_vline(
    xintercept = c(0.5, 6.5),
    colour = "black"
  ) +
  coord_flip(
    clip = "off",
  ) +
  scale_y_discrete(position = "right", expand = expansion(add = c(0.5, 0))) +
  scale_x_discrete(expand = expansion(add = 0.3)) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x.top = element_text(hjust = 1),
    axis.title = element_blank()
  )

If you have questions or comments, find me on mastodon or shoot me a mail!