Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Draft PDF
on:
push:
paths:
- paper/**
- .github/workflows/draft-pdf.yml

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper.pdf
24 changes: 24 additions & 0 deletions paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
@article{Filazzola:2022,
title = {A call for clean code to effectively communicate science},
volume = {13},
url = {https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13961},
doi = {https://doi.org/10.1111/2041-210X.13961},
abstract = {Abstract Effective coding is fundamental to the study of biology. Computation underpins most research, and reproducible science can be promoted through clean coding practices. Clean coding is crafting code design, syntax and nomenclature in a manner that maximizes the potential to communicate its intent with other scientists. However, computational biologists are not software engineers, and many of our coding practices have developed ad hoc without formal training, often creating difficult-to-read code for others. Hard-to-understand code can thus be limiting our efficiency and ability to communicate as scientists with one another. The purpose of this paper is to provide a primer on some of the practices associated with crafting clean code by synthesizing a transformative text in software engineering along with recent articles on coding practices in computational biology. We review past recommendations to provide a series of best practices that transform coding into a human-accessible form of communication. Three common themes shared in this synthesis are the following: (a) code has value and you are responsible for its organization to enable clear communication, (b) use a formatting style to guide writing code that is easily understandable and consistent and (c) apply abstraction to emphasize important elements and declutter. While many of the provided practices and recommendations were developed with computational biologists in mind, we believe there is wider applicability to any biologist undertaking work in data management or statistical analyses. Clean code is thus a crucial step forward in resolving some of the crisis in reproducibility for science.},
number = {10},
journal = {Methods in Ecology and Evolution},
author = {Filazzola, Alessandro and Lortie, CJ},
year = {2022},
note = {\_eprint: https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13961},
keywords = {open science, principles, programming, replication, reproducibility, science communication, transparency},
pages = {2119--2128},
}

@misc{Wilson:2021,
title = {Task {Interruption} in {Software} {Development} {Projects}},
url = {https://neverworkintheory.org/2021/08/09/task-interruption-in-software-development-projects.html},
urldate = {2025-05-04},
journal = {It Will Never Work in Theory},
author = {Wilson, Greg},
month = aug,
year = {2021},
}
137 changes: 137 additions & 0 deletions paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
title: 'annotater: Enhancing library load calls in R'
tags:
- R
- Reproducibility
- code comments
- versioning
- packages
authors:
- name: Luis D. Verde Arregoitia
orcid: 0000-0001-9520-6543
affiliation: 1
- name: Juan Cruz Rodríguez
affiliation: 2
affiliations:
- name: Laboratorio de Macroecología Evolutiva, Red de Biología Evolutiva, Instituto de Ecología, A.C., Carretera Antigua a Coatepec 351, Col. El Haya, Xalapa, 91073, Veracruz, Mexico
index: 1
- name: FAMAF, Universidad Nacional de Córdoba, Argentina
index: 2
date: 4 May 2025
bibliography: paper.bib
---

# Summary

Extensions and packages extend the capabilities of a programming language, and working with source code and scripts rather than interactively lets us document and repeat our workflows. However, in the R ecosystem, the sheer number and diversity of existing packages can be overwhelming. In this context, the purpose of individual packages or their role in projects can become unclear. To address this lack of context, one practical approach is to add information directly to scripts using code comments. Code comments are annotations within code meant for the human reader, not the machine, meant to provide additional information or clarity to what is being executed [@Filazzola:2022].

`annotater` is an R package for automated commenting of library load calls in R scripts, or text-based formats that allow for embedded code blocks such as R Markdown and Quarto (`.rmd` and `.qmd` files, respectively).


# Statement of need

The functions in `annotater` address an unmet need in R for improving code comprehension regarding loaded packages in scripts. Most scripts load numerous packages, which may not have self-explanatory names and are often loaded without mentioning their purpose, source, or which specific functions and datasets are actually used. This lack of explicit information does not imply bad coding practices, but adding useful information as unobtrusive comments can lead to self-documented and understandable code, ultimately improving individual and collaborative workflows.

When opening a script, the role of a loaded package may not be evident, requiring manual investigation. This might mean interrupting our work to check the documentation or search the web to understand more about the loaded packages. This context switching [@Wilson:2021] can slow down code review, collaboration, and reduce productivity, especially when there are many dependencies or when code is shared between users with different backgrounds and personal 'dialect' preferences (e.g., users of different package 'families' for data manipulation, spatial data work, or statistical modeling frameworks).

In addition, tracking the exact versions and sources of loaded packages is important for ensuring the reproducibility of analyses and results. For example, using the stable vs. development version of a package might mean the difference between a workflow failing or succeeding. Manually noting this information can be tedious or prone to error, but `annotater` functions can easily note the source and version of a package in a user's machine. This approach does not guarantee the automatic recreation of the original execution environment and is not meant to replace existing tools that create comprehensive reproducible environments, such as renv, Docker, or Nix.

Lastly, identifying which parts of a script rely on specific packages and their components can be challenging, making it harder to refactor code, manage dependencies, or identify unused packages.



# Features and examples

Upon installation, R packages already include useful details that we can leverage to automate the creation of these informative comments. These annotations can be particularly useful for sharing code with others, as a way to provide immediate context about why each package is being used and for what purpose. The code in a script can also be examined programatically so that the functions, methods, or datasets being used from each package can also be added as comments.


Code can be annotated interactively using the package functions or through addins in the RStudio IDE.

The following annotations are supported. The code blocks below show the output of the different features on small scripts.

- Add package titles

``` r
library(brms) # Bayesian Regression Models using 'Stan'
library(caper) # Comparative Analyses of Phylogenetics and Evolution in R
library(readr) # Read Rectangular Text Data
library(picante) # Integrating Phylogenies and Ecology
```

- Add package installation sources and versions. Supports various sources including CRAN, GitHub, GitLab, Bioconductor, Posit Package Manager (RSPM), and R-universe.


``` r
library(brms) # [github::paul-buerkner/brms] v2.22.11
library(caper) # CRAN v1.0.3
library(readr) # Posit RSPM v2.1.5
library(picante) # CRAN v1.8.2
```

- Identify functions and datasets being used from each package


``` r
# functions
library(brms) # No used functions found
library(caper) # No used functions found
library(readr) # read_csv
library(picante) # df2vec

dat <- read_csv("mdata.csv")
df2vec(dat, colID = Y1)

```

``` r
# data
library(caper) # shorebird.data
library(readr) # No loaded datasets found
library(picante) # No loaded datasets found

data(shorebird)
hist(shorebird.data$F.Mass)
```

- Compatible with both `library()` and `p_load()` calls when loading packages with `pacman`

``` r
# add source and version to pacman call
library(readr) # Posit RSPM v2.1.5
pacman::p_load(
caper, # CRAN v1.0.3
picante # CRAN v1.8.2
)
```

- Expand popular metapackages into their loaded components. Will change `library(tidyverse)` into:

``` r
####
library(ggplot2)
library(tibble)
library(tidyr)
library(readr)
library(purrr)
library(dplyr)
library(stringr)
library(forcats)
library(lubridate)
####
```

In its development version, `annotater` supports adding R and RStudio versions, platform, and operating system to the beginning of a script.

## Concluding remarks

`annotater` is available on GitHub, CRAN, and R-universe and has a dedicated website for documentation (https://annotater.liomys.mx/). Since its release on CRAN, `annotater` has been downloaded over 12,000 times. Community adoption can be inferred from public code searches for patterns commonly generated by `annotater`. For example, GitHub queries for strings such as "`) # CRAN v`", "`) # Create elegant`", and "`) # A grammar`" -which typically annotate library calls for packages like `dplyr` and `ggplot2`- return over 1,000 results. These matches suggest that a significant number of users are using `annotater` to automatically append version and title comments to their package load calls.

It is worth noting that Large Language Model (LLM) tools can now generate inline explanations for loaded packages. However, `annotater` represents a more parsimonious approach with distinct practical advantages. `annotater` runs locally in R, requiring no internet access, incurring no usage fees, and eliminating the need for setting up local models or managing API keys. Most importantly, package information is obtained directly from users' installations, avoiding issues related to the source and copyright of training data for external LLM tools.

The `annotater` package provides a valuable solution by offering a non-invasive method to automatically add informative comments alongside package load calls. By annotating scripts with package titles, repository sources, versions, and even the functions and datasets being used, `annotater` significantly enhances code clarity and provides essential information for reproducibility and maintenance.

# Acknowledgements

We acknowledge the [LatinR](https://latinr.org/) community for ongoing feedback and promotion of the package.

# References