Skip to content

Simplify Mountain Wetlands harvester once MaRESS exposes DOIs #244

@nuest

Description

@nuest

Context

The bespoke MaRESS harvester (added in #192, commit `add Mountain Wetlands Repository (MaRESS) harvester`) takes a complicated path because every record currently returned by the public API has `DOI=null` and an empty `url`:

  • All 234 records as of 2026-04 have those fields blank — see the investigation in Harvest Mountain Wetlands Repository & introduce collections #192.
  • We compensate by calling `build_openalex_fields(title, doi=None, author=)` and post-classifying the result into `provenance.openalex_match.status` ∈ {`verified`, `candidate`, `none`}. When verified, we extract the DOI from `openalex_ids` and persist it on the Work.
  • This works but produces a noticeable fraction of `candidate` and `none` matches because title+author-only matching is fragile, especially on older records (1990s) and short titles.

The MaRESS maintainers have indicated that DOIs will likely be added to the API records soon. Once that lands, the harvester can be simplified considerably.

What to revisit when DOIs land

  • Use DOI as the primary OpenAlex match key — pass `build_openalex_fields(title, doi=<api_doi>)` and rely on the existing DOI-match strategy. Title+author becomes a fallback, not the primary signal.
  • Skip OpenAlex matching entirely when both DOI and authors come from the API — there is no extra metadata to recover and the API call is just rate-limit pressure.
  • Persist DOIs from the API directly, not via the OpenAlex `openalex_ids` round-trip. Drop the `raw.split('doi.org/', 1)[-1]` cleanup once we can trust the raw value.
  • Backfill existing harvested works whose `provenance.openalex_match.status` is `candidate` or `none`: re-run enrichment against the now-DOI-bearing records and upgrade matches where possible. The verbatim API record stored in `provenance.harvest.original_record` is the input — no re-fetch needed.
  • Drop the "et al." / null-firstName special-casing in `_mwr_first_author_surname` if DOI matching makes the surname signal redundant.
  • Document the simplified flow in MANAGE.md under "Mountain Wetlands Repository (MaRESS) — `mountain-wetlands` source type".

How to verify the API is ready

Run `curl -s 'https://andes.mountain-wetlands-repository.info/api/v1/items/?limit=500&scope=all' | jq '[.data[] | select(.DOI != null)] | length'` and check the result is a meaningful fraction of `.count`. If most records carry a DOI, this issue is actionable.

Out of scope

  • Replacing the harvester architecture itself — even with DOIs, the Zotero-shaped `study_sites` list still differs from OAI-PMH/RSS/Crossref enough to warrant a bespoke harvester.
  • Authenticated API access (BibTeX export, Zotero sync endpoints) — separate question, separate issue if pursued.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions