Skip to content

Adds Zenodo data archival functionality#249

Open
nuest wants to merge 12 commits into
GeoinformationSystems:mainfrom
nuest:feature/zenodo-deposit
Open

Adds Zenodo data archival functionality#249
nuest wants to merge 12 commits into
GeoinformationSystems:mainfrom
nuest:feature/zenodo-deposit

Conversation

@nuest
Copy link
Copy Markdown
Member

@nuest nuest commented May 11, 2026

Enables automatic data package generation and deposition to Zenodo, ensuring long-term preservation and citability for OPTIMAP data.

This comprehensive feature introduces:

Data Package Generation

  • A render_zenodo command that builds essential artifacts: a dynamic README.md (generated from a Jinja2 template with live statistics and source information), an optimap-main.zip containing the project's source code snapshot, and a zenodo_dynamic.json file for flexible metadata updates.
  • This process also manages versioning via last_version.txt.

Zenodo Deposition Management

  • A deposit_zenodo command for updating existing Zenodo draft depositions. It intelligently merges metadata, protecting crucial identifiers like DOIs, and ensures a clean slate by deleting previous files before uploading new ones.
  • A combined zenodo_deposit command simplifies the workflow by executing both the rendering and deposition steps sequentially.

Logging, Monitoring, and Notifications

  • A new ZenodoDepositionLog model records every deposition attempt, tracking status, uploaded files, total size, duration, and any errors encountered.
  • This log is accessible and viewable in the Django admin, offering detailed insights into each archival event.
  • Staff users receive email notifications detailing the outcome of each deposition, including direct links to the Zenodo draft for review.
  • The /data public page now prominently displays information about the latest successful Zenodo deposition, with environment-aware display (sandbox in DEBUG, production otherwise).

Streamlined Administration

  • An admin action "Trigger Zenodo Deposition" is available for Works, allowing a full render and deposit cycle to be initiated directly from the admin interface.

Configuration

  • New settings (ZENODO_API_TOKEN, ZENODO_SANDBOX_DEPOSITION_ID, ZENODO_API_BASE) are introduced for flexible environment configuration, supported by a tests/.env.template.

Enhanced Testing

  • Includes dedicated unit tests for the rendering and deposition logic, alongside robust integration tests that run against the actual Zenodo sandbox API, ensuring reliable end-to-end functionality.

Relates to #63

nuest and others added 12 commits May 11, 2026 12:24
Implements functionality to deposit OPTIMAP data to Zenodo by creating/updating draft records. This feature enables automated archival and versioning of research data for long-term preservation and citation.

Features:
- Two Django management commands:
  - `render_zenodo`: Generates metadata files and data archives
  - `deposit_zenodo`: Uploads files and merges metadata to Zenodo drafts
- Updates existing drafts only (requires deposition ID)
- Never publishes automatically - manual approval required in Zenodo UI
- Uploads: README.md, optimap-main.zip, latest GeoJSON and GeoPackage files
- Merges metadata non-destructively without overwriting stable fields
- Configurable via environment variables (ZENODO_API_TOKEN, etc.)
- Comprehensive test coverage for rendering and deposition

New files:
- works/management/commands/deposit_zenodo.py - Upload to Zenodo
- works/management/commands/render_zenodo.py - Generate metadata/archives
- works/templates/README.md.j2 - Jinja2 template for README
- data/README.md, data/last_version.txt, data/zenodo_dynamic.json
- tests/test_deposit_zenodo.py - Deposition tests
- tests/test_render_zenodo.py - Render tests

Modified files:
- .gitignore - Ignore Zenodo artifacts
- optimap/settings.py - Add Zenodo configuration
- requirements.txt - Add zenodo-client, markdown, jinja2 dependencies

This implementation is adapted from PR #214 to work with the refactored
codebase (publications/ → works/ directory structure).

Closes ifgi#63

Co-authored-by: BharatVe <bharatveauli@live.com>
Co-authored-by: BharatVe <150399011+BharatVe@users.noreply.github.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds comprehensive integration test suite for Zenodo deposition functionality
with support for testing against the actual Zenodo sandbox API.

Changes:
- Fixed model references in tests (Publication → Work, publications → works)
- Added tests/.env.template with configuration instructions
- Created test_zenodo_integration.py with tagged integration tests
- Tests can run against real Zenodo sandbox API with proper credentials
- Added .env file to .gitignore to protect secrets

Test categories:
- Unit tests: Mock-based tests (existing)
- Integration tests: Real API tests (new, tagged as 'integration')
- Full deposit tests: End-to-end upload tests (tagged as 'slow' and 'upload')

Usage:
  # Run only unit tests (no API calls):
  python manage.py test tests.test_deposit_zenodo tests.test_render_zenodo

  # Run integration tests (requires tests/.env):
  python manage.py test tests.test_zenodo_integration

  # Run specific test tags:
  python manage.py test --tag=integration
  python manage.py test --exclude-tag=slow

Setup:
  1. Copy tests/.env.template to tests/.env
  2. Add Zenodo sandbox API token from https://sandbox.zenodo.org
  3. Create a draft deposition and add its ID to .env
  4. Run: python manage.py test tests.test_zenodo_integration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements automated data archival to Zenodo for long-term preservation and citability.

- Introduces a new `zenodo` app with functions for rendering metadata, depositing data, and managing Zenodo records.
- Creates new management commands (`render_zenodo`, `deposit_zenodo`, and `zenodo_deposit`) for simplified workflow.
- Adds a new `ZenodoDepositionLog` model to track deposition history and status.
- Enhances the Django admin interface with actions to trigger depositions and view logs.
- Includes comprehensive documentation in `README.md` on setting up and using the Zenodo integration.
Refs ifgi#63.

- untrack data/README.md, data/zenodo_dynamic.json, data/last_version.txt
  (sandbox render output from local runs leaked into the branch); extend
  .gitignore to cover them plus CSV dump variants
- fix the README.md.j2 sources loop — was unpacking dicts as (label, url)
  tuples so every entry rendered as "[name](url)" with no newline between
  items; iterate over Source dicts properly
- switch tests/test_deposit_zenodo.py and tests/test_render_zenodo.py from
  unittest.TestCase to django.test.TestCase so the in-test
  ZenodoDepositionLog.save() and ORM-created Source rows hit a real test
  DB instead of crashing (deposit) or polluting the dev DB (render)
- refresh the 0009 migration header timestamp
- CHANGELOG entry under Unreleased describing the deposit groundwork
Refs ifgi#63 (item 5).

The render step now overwrites `related_identifiers` on every invocation
with the three live download endpoints on optimap.science
(geojson / geopackage / csv), derived from settings.BASE_URL + the URL
config. Any stale identifiers from a previous render (e.g. localhost
URLs left over from a dev run) are discarded, so a deposit can never
publish links that only work on a developer's machine.

Each entry uses scheme=url, relation=isSupplementTo, resource_type=dataset.
Source-level "describes" entries land in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (item 6 / 2025-07-14 comment).

Per harvested Source, the render step now adds a related_identifiers
entry with relation=describes, resource_type=publication — wording
straight from nuest's 2025-07-14 comment ("This record describes
Journal X"). Scheme picked in order:

  1. issn   — Source.issn_l (linking ISSN)
  2. url    — Source.homepage_url canonicalised
  3. url    — Source.url_field canonicalised

Self-references to optimap.science are skipped (the portal isn't a
journal it describes), and duplicates collapse on the resolved
(scheme, identifier) pair so two Source rows pointing at the same
journal collapse to one entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (item 4).

The deposit's file list now covers every output of regenerate_data_dumps:
geojson, geojson.gz, gpkg, csv, and csv.gz. Previously only geojson(.gz)
and gpkg shipped — CSV (issue #206) had been added on main but no one
told Zenodo about it.

The helper now also picks the newest cycle by timestamp when several
co-exist in the same dir, so a deposit can't ship a stale .gpkg next
to a fresh .geojson. README.md and optimap-main.zip still come from
data_dir (where render writes them); data dumps prefer data_dir first
(tests / single-dir layouts) and fall back to /tmp/optimap_cache (the
default cache dir for production regenerate runs). dump_dir is a
parameter so other callers can override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (last checklist item).

The render step previously swallowed every error from `git archive HEAD`
and then wrote a 0-byte optimap-main.zip as a "fallback", so a missing
git binary, a non-repo working directory, or a `CalledProcessError`
would all produce an empty zip that the deposit then uploaded to Zenodo
under a "success" status.

Now:

- FileNotFoundError (`git` not on PATH) → RuntimeError with a clear hint.
- CalledProcessError → RuntimeError including the exit code and stderr.
- subprocess.run exits 0 but the file is missing or 0 bytes →
  RuntimeError with the stderr (covers SIGPIPE / corrupt repo / empty
  tree cases).

The tests are adjusted to write a small non-empty stub zip in the
patched subprocess.run, and gain two new cases for the failure paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (comment 2025-07-14, comment 2025-07-21).

README codebook expands to cover every Work field that ends up in the
data dumps — including the ones added since the original Zenodo branch
landed: `type`, `authors`, `keywords`, `topics`, `bok_concepts`,
`placename`, `country_code`, `volume`/`issue`/`first_page`/`last_page`,
`openalex_*`. A short note up front states that the same field names
appear verbatim as GeoJSON `Feature.properties`, CSV column headers and
GeoPackage attribute columns, with CSV using `WKT` for geometry.

Default keywords now include `Open Research Information` alongside `ORI`
so the record is findable under either label, per the issue comment.

A new `additional_descriptions[type=notes]` entry documents the
CC0-1.0 / GPL-3.0 license split with the actual file scopes — README
+ optimap_data_dump_*.{geojson,geojson.gz,gpkg,csv,csv.gz} under CC0,
optimap-main.zip under GPL-3.0. Default `patch_fields` in
`deposit_to_zenodo` (and the deposit_zenodo command) is extended so
the note actually gets pushed.

The render test now copies the real README.md.j2 from the source tree
into the tmp project root instead of using a tiny stub, so codebook
and prose assertions exercise the production template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (2025-08-21 issue comment, Q2 decision).

Renders now include a structured `grants` block with the two OPTIMAP
grant IDs in OpenAIRE format:

  - OPTIMETA: 10.13039/501100002347::16TOA028B (BMBF)
  - KOMET:    10.13039/501100002347::16KOA009A (BMFTR)

NFDI4Earth is deliberately excluded per the August comment.

Zenodo's curated grants vocabulary doesn't cover every grant — when the
metadata PUT returns 400 mentioning `grants`, the deposit now retries
once with `grants` removed and prepends a free-text "Funding: …"
paragraph to `metadata.notes`, so the funding info is still discoverable
even if Zenodo can't resolve the IDs structurally. The fallback is
recorded on ZenodoDepositionLog.notes for the admin email.

`grants` is added to the default `--patch` list on `deposit_zenodo`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…file

Refs ifgi#63.

The version counter (v1, v2, v3, …) is now read from the latest
successful ZenodoDepositionLog row for the current api_base instead of
data/last_version.txt. The file had three problems:

  - it lived in the project tree but was never committed, so a fresh
    checkout silently restarted at v1
  - sandbox and production runs shared the same counter, so a stream of
    sandbox renders would jump production's next version into double
    digits
  - a failed deposit still bumped the file, burning a version number
    that never reached Zenodo

The new logic filters ZenodoDepositionLog by (status='success',
api_base=…), takes the latest `version`, and emits N+1. Sandbox and
production increment independently. Failed deposits don't advance the
counter. render_zenodo_package gains an optional api_base argument with
the same env/settings cascade as deposit_to_zenodo.

deposit_to_zenodo now reads log_entry.version from the rendered
zenodo_dynamic.json instead of the tracking file. The model and
migration help_text are updated to match; .gitignore drops the now-
obsolete data/last_version.txt entry; the integration tests stop
seeding the file too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63

Make deposition_id optional in deposit_to_zenodo(): if not passed, fall
back to the latest successful ZenodoDepositionLog for the same api_base;
if there is no prior log either, bootstrap a fresh draft via
POST /deposit/depositions. When the resolved record is already published
(submitted=true + state="done"), POST .../actions/newversion and switch
to the new draft from links.latest_draft before uploading. The admin
action and both management commands drop their "no deposition ID"
guards.

Wrap the full cycle (regenerate dumps → render package → deposit) in
works.tasks.run_zenodo_deposition and add a `schedule_zenodo_deposit`
management command that idempotently registers it as a yearly Django-Q
schedule for Dec 31 23:59. Publishing remains manual.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant