Google Scholar is not perfect. Some might even call it worse:
It is also an opaque outgrowth of a for-profit tech-giant that one (read ‘I’) might want to spend less time with. Even beyond such idealistic musings, the lack of polish and development of Google Scholar makes its status as a side project quite apparent. Who knows how long it’ll be around in it’s current, useful form.
Thankfully, there is an alternative: OpenAlex is a relatively new, open-source scholarly indexing project. It contains all the data we need to build a drop-in replacement of the Google Scholar page:
The quality of its database and author disambiguation has been rapidly improving over recent months. By now, I believe it’s good enough that this blog post has merit, and I am hopeful that it is going to get even better.
The works JSON is remarkably detailed and, thankfully, regular.
Plucking out the bits we need is straightforward. The data below is
pretty self-explanatory, with one exception.
primary_location.version
describes the “best” accessible
version of the article. For pre-publication peer-reviewed articles in
traditional journals, is returns publishedVersion
, for
preprints it returns submittedVersion
, allowing us to
distinguish those two article types.
<- tibble(
works authorship = full_works[["results"]][["authorships"]], # extract authors
authors_long = map(authorship, c(2, 2)), # extract author names
authors_short = map(authors_long, WikidataR::initials), # shorten first names
author_list = map(authors_short, \(x) str_c(x, collapse = ", ")), # collapse authors into a list
title = full_works[["results"]][["title"]], # extract titles
doi = full_works[["results"]][["doi"]], # extract DOIs
journal = full_works[["results"]][["primary_location"]][["source"]][["display_name"]], # extract journal names
version = full_works[["results"]][["primary_location"]][["version"]], # extract "best" accessible versions
type = case_when(version == "submittedVersion" ~ "Preprint", TRUE ~ "Journal"), # distinguish preprints
pub_year = full_works[["results"]][["publication_year"]], # extract online publication years
date = ymd(full_works[["results"]][["publication_date"]]), # extract online publication dates
cited_by = full_works[["results"]][["cited_by_count"]], # extract citation counts
cites_by_year = full_works[["results"]][["counts_by_year"]], # extract citations by year
)
Then we can paste together full citations (I leave out volume, issue and page numbers here, but you can easily add those in the same way):
<- works |>
works mutate(
item = paste0( # paste together the full citations
". ",
author_list, ". ",
pub_year, ". ",
title, "<i>", journal, ".</i> ",
"[", str_remove(doi, fixed("https://doi.org/")), "](", doi, ")"
) )
In my publication list there are two theses and one correction. I’ll filter those out.
<- works |>
works drop_na(doi) |> # drop items without a doi
arrange(date) |> # arrange by date so we drop the right items in the next line
distinct(title, journal, .keep_all = TRUE) # drop later items with duplicate names (e.g. corrections)
Now we just need to highlight the name of the selected author in each publication, sort the list and make a table
<- WikidataR::initials(full_profile[["display_name"]])
short_name
<- works |>
work_table # find author name and put it in bold face
mutate(item = str_replace(item, short_name, paste0("<b>", short_name, "</b>"))) |>
arrange(desc(date)) |>
select(
"Reference" = item,
"Citations" = cited_by,
"Year" = pub_year,
type )
|>
work_table filter(type == "Journal") |>
select(-type) |>
kable(
escape = FALSE
|>
) kable_styling(
full_width = FALSE
)
Reference | Citations | Year |
---|---|---|
E Pesquet, L Blaschek, J Takahashi, M Yamamoto, A Champagne, Nuoendagula, E Subbotina, C Dimotakis, Z Bascik, S Kajita. 2023. Bulk and In Situ Quantification of Coniferaldehyde Residues in Lignin. Methods in molecular biology. 10.1007/978-1-0716-3477-6_14 | 0 | 2023 |
G Pedersen, L Blaschek, K Frandsen, L Noack, S Persson. 2023. Cellulose synthesis in land plants. Molecular Plant. 10.1016/j.molp.2022.12.015 | 8 | 2023 |
L Blaschek, E Murozuka, H Serk, D Ménard, E Pesquet. 2022. Different combinations of laccase paralogs nonredundantly control the amount and composition of lignin in specific cell types and cell wall layers in Arabidopsis. The Plant Cell. 10.1093/plcell/koac344 | 12 | 2022 |
D Ménard, L Blaschek, K Kriechbaum, C Lee, H Serk, C Zhu, A Lyubartsev, Nuoendagula, Z Bacsik, L Bergström, A Mathew, S Kajita, E Pesquet. 2022. Plant biomechanics and resilience to environmental changes are controlled by specific lignin chemistries in each vascular cell type and morphotype. The Plant Cell. 10.1093/plcell/koac284 | 7 | 2022 |
L Blaschek, E Pesquet. 2021. Phenoloxidases in Plants—How Structural Diversity Enables Functional Specificity. Frontiers in Plant Science. 10.3389/fpls.2021.754601 | 22 | 2021 |
M Yamamoto, L Blaschek, E Subbotina, S Kajita, E Pesquet. 2020. Importance of Lignin Coniferaldehyde Residues for Plant Properties and Sustainable Uses. ChemSusChem. 10.1002/cssc.202001242 | 11 | 2020 |
L Blaschek, Nuoendagula, Z Bacsik, S Kajita, E Pesquet. 2020. Determining the Genetic Regulation and Coordination of Lignification in Stem Tissues of Arabidopsis Using Semiquantitative Raman Microspectroscopy. ACS Sustainable Chemistry & Engineering. 10.1021/acssuschemeng.0c00194 | 15 | 2020 |
L Blaschek, A Champagne, C Dimotakis, Nuoendagula, R Decou, S Hishiyama, S Kratzer, S Kajita, E Pesquet. 2020. Cellular and Genetic Regulation of Coniferaldehyde Incorporation in Lignin of Herbaceous and Woody Plants by Quantitative Wiesner Staining. Frontiers in Plant Science. 10.3389/fpls.2020.00109 | 22 | 2020 |
|>
work_table filter(type == "Preprint") |>
select(-type) |>
kable(
escape = FALSE
|>
) kable_styling(
full_width = FALSE
)
Reference | Citations | Year |
---|---|---|
L Blaschek, E Murozuka, D Ménard, E Pesquet. 2022. Different combinations of laccase paralogs non-redundantly control the lignin amount and composition of specific cell types and cell wall layers in Arabidopsis. bioRxiv (Cold Spring Harbor Laboratory). 10.1101/2022.05.04.490011 | 1 | 2022 |
D Ménard, L Blaschek, K Kriechbaum, C Lee, H Serk, C Zhu, A Lyubartsev, Nuoendagula, Z Bacsik, L Bergström, A Mathew, S Kajita, E Pesquet. 2021. Specific and dynamic lignification at the cell-type level controls plant physiology and adaptability. bioRxiv (Cold Spring Harbor Laboratory). 10.1101/2021.06.12.447240 | 3 | 2021 |
Now that we have our publication list, we can add some figures to plot output and citations over the years. Let’s start with cumulative articles by years, split into peer-reviewed articles and preprints:
<- works |>
work_timeline select(doi, pub_year, type) |>
count(pub_year, type, name = "works") |>
complete(pub_year, type, fill = list(works = 0)) |>
group_by(type) |>
arrange(pub_year) |>
mutate(works_cum = cumsum(works))
ggplot(
work_timeline,aes(
x = pub_year,
y = works_cum,
colour = type
)+
) geom_line() +
geom_point(
shape = 21,
size = 2,
stroke = 0.8,
fill = "white"
+
) annotate(
"text",
label = "Works — cumulative",
x = min(work_timeline$pub_year) - 0.5,
y = max(work_timeline$works_cum),
size = 14 / (14 / 5),
hjust = 0,
vjust = 0,
colour = "black"
+
) scale_colour_manual(values = c("#275d95", "#e8c245")) +
coord_cartesian(clip = "off") +
theme_minimal(base_size = 14) +
theme(
legend.title = element_blank(),
axis.title = element_blank(),
legend.position = c(0.2, 0.8),
legend.text = element_text(colour = "black")
)
And then, emulating the Google Scholar page, yearly citations:
<- tibble(
cite_timeline year = full_profile[["counts_by_year"]][["year"]],
cited_by_count = full_profile[["counts_by_year"]][["cited_by_count"]]
|>
) arrange(year)
ggplot(
cite_timeline,aes(
x = year,
y = cited_by_count
)+
) geom_col(
fill = "#e8c245"
+
) annotate(
"text",
label = "Citations — yearly",
x = min(cite_timeline$year) - 0.5,
y = max(cite_timeline$cited_by_count),
size = 14 / (14 / 5),
hjust = 0,
vjust = 0,
colour = "black"
+
) coord_cartesian(clip = "off") +
theme_minimal(base_size = 14) +
theme(axis.title = element_blank())
Lastly, we can add some metrics. We already have our total citation counts. To pull even with the Google Scholar page, let’s add the data within the last five years. We’ll need to manually calculate the h-index in that timespan, so we’ll define a little function for that:
<- function(cites) {
h_index if (max(cites) == 0) {
return(0)
}<- cites[order(cites, decreasing = TRUE)]
cites tail(which(cites >= seq_along(cites)), 1)
}
<- works |>
cites_fiveyear select(doi, pub_year, cites_by_year) |>
unnest(cites_by_year) |>
filter(year > year(now()) - 5) |>
group_by(doi) |>
summarise(cited_by_count = sum(cited_by_count))
<- h_index(cites_fiveyear$cited_by_count) h_fiveyear
Going beyond both OpenAlex and Google Scholar, we can also include
verified peer-reviews, as long as they appear in your ORCID. You’ll need
to set up ORCID authorization, check ?orcid_auth
for a
how-to.
<- orcid_peer_reviews(orcid)[[orcid]][["group"]][["peer-review-group"]] |>
full_peer_reviews unlist()
<- tibble(
peer_reviews review = full_peer_reviews[grepl(
"peer-review-summary.external-ids.external-id.external-id-value", names(full_peer_reviews)
)],year = full_peer_reviews[grepl(
"peer-review-summary.completion-date.year.value", names(full_peer_reviews)
)]|>
) summarise(count = n())
<- tibble(
peer_reviews_fiveyear review = full_peer_reviews[grepl(
"peer-review-summary.external-ids.external-id.external-id-value", names(full_peer_reviews)
)],year = full_peer_reviews[grepl(
"peer-review-summary.completion-date.year.value", names(full_peer_reviews)
)]|>
) filter(year > year(now()) - 5) |>
summarise(count = n())
Finally, let’s compile all those values into one tibble and plot them (we could use kable here, too, but I prefer the control of doing it in ggplot2):
<- tibble(
metrics metric = ordered(
rep(c("Citations", "h index", "i10 index", "Journal articles", "Preprints", "Peer reviews"), 2),
levels = rev(c("Citations", "h index", "i10 index", "Journal articles", "Preprints", "Peer reviews"))
),span = ordered(
c(rep(c("Total", paste0("Since ", year(now()) - 5)), each = 6)),
levels = c("Total", paste0("Since ", year(now()) - 5))
),value = c(
# total
sum(cite_timeline$cited_by_count),
"summary_stats"]][["h_index"]],
full_profile[["summary_stats"]][["i10_index"]],
full_profile[[max(work_timeline$works_cum[work_timeline$type == "Journal"]),
max(work_timeline$works_cum[work_timeline$type == "Preprint"]),
$count,
peer_reviews# five-year
sum(cite_timeline$cited_by_count[cite_timeline$year > year(now()) - 5]),
h_fiveyear,length(cites_fiveyear$cited_by_count[cites_fiveyear$cited_by_count >= 10]),
sum(work_timeline$works[cite_timeline$year > year(now()) - 5 & work_timeline$type == "Journal"]),
sum(work_timeline$works[cite_timeline$year > year(now()) - 5 & work_timeline$type == "Preprint"]),
$count
peer_reviews_fiveyear
)
)
ggplot(
metrics,aes(
x = metric,
y = span,
label = value
)+
) geom_text(
size = (14 / (14 / 5)),
hjust = 1,
colour = "black"
+
) geom_vline(
xintercept = c(0.5, 6.5),
colour = "black"
+
) coord_flip(
clip = "off",
+
) scale_y_discrete(position = "right", expand = expansion(add = c(0.5, 0))) +
scale_x_discrete(expand = expansion(add = 0.3)) +
theme_minimal(base_size = 14) +
theme(
axis.text.x.top = element_text(hjust = 1),
axis.title = element_blank()
)
If you have questions or comments, find me on mastodon or shoot me a mail!