Adds Zenodo data archival functionality#249
Open
nuest wants to merge 12 commits into
Open
Conversation
Implements functionality to deposit OPTIMAP data to Zenodo by creating/updating draft records. This feature enables automated archival and versioning of research data for long-term preservation and citation. Features: - Two Django management commands: - `render_zenodo`: Generates metadata files and data archives - `deposit_zenodo`: Uploads files and merges metadata to Zenodo drafts - Updates existing drafts only (requires deposition ID) - Never publishes automatically - manual approval required in Zenodo UI - Uploads: README.md, optimap-main.zip, latest GeoJSON and GeoPackage files - Merges metadata non-destructively without overwriting stable fields - Configurable via environment variables (ZENODO_API_TOKEN, etc.) - Comprehensive test coverage for rendering and deposition New files: - works/management/commands/deposit_zenodo.py - Upload to Zenodo - works/management/commands/render_zenodo.py - Generate metadata/archives - works/templates/README.md.j2 - Jinja2 template for README - data/README.md, data/last_version.txt, data/zenodo_dynamic.json - tests/test_deposit_zenodo.py - Deposition tests - tests/test_render_zenodo.py - Render tests Modified files: - .gitignore - Ignore Zenodo artifacts - optimap/settings.py - Add Zenodo configuration - requirements.txt - Add zenodo-client, markdown, jinja2 dependencies This implementation is adapted from PR #214 to work with the refactored codebase (publications/ → works/ directory structure). Closes ifgi#63 Co-authored-by: BharatVe <bharatveauli@live.com> Co-authored-by: BharatVe <150399011+BharatVe@users.noreply.github.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds comprehensive integration test suite for Zenodo deposition functionality with support for testing against the actual Zenodo sandbox API. Changes: - Fixed model references in tests (Publication → Work, publications → works) - Added tests/.env.template with configuration instructions - Created test_zenodo_integration.py with tagged integration tests - Tests can run against real Zenodo sandbox API with proper credentials - Added .env file to .gitignore to protect secrets Test categories: - Unit tests: Mock-based tests (existing) - Integration tests: Real API tests (new, tagged as 'integration') - Full deposit tests: End-to-end upload tests (tagged as 'slow' and 'upload') Usage: # Run only unit tests (no API calls): python manage.py test tests.test_deposit_zenodo tests.test_render_zenodo # Run integration tests (requires tests/.env): python manage.py test tests.test_zenodo_integration # Run specific test tags: python manage.py test --tag=integration python manage.py test --exclude-tag=slow Setup: 1. Copy tests/.env.template to tests/.env 2. Add Zenodo sandbox API token from https://sandbox.zenodo.org 3. Create a draft deposition and add its ID to .env 4. Run: python manage.py test tests.test_zenodo_integration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Implements automated data archival to Zenodo for long-term preservation and citability. - Introduces a new `zenodo` app with functions for rendering metadata, depositing data, and managing Zenodo records. - Creates new management commands (`render_zenodo`, `deposit_zenodo`, and `zenodo_deposit`) for simplified workflow. - Adds a new `ZenodoDepositionLog` model to track deposition history and status. - Enhances the Django admin interface with actions to trigger depositions and view logs. - Includes comprehensive documentation in `README.md` on setting up and using the Zenodo integration.
Refs ifgi#63. - untrack data/README.md, data/zenodo_dynamic.json, data/last_version.txt (sandbox render output from local runs leaked into the branch); extend .gitignore to cover them plus CSV dump variants - fix the README.md.j2 sources loop — was unpacking dicts as (label, url) tuples so every entry rendered as "[name](url)" with no newline between items; iterate over Source dicts properly - switch tests/test_deposit_zenodo.py and tests/test_render_zenodo.py from unittest.TestCase to django.test.TestCase so the in-test ZenodoDepositionLog.save() and ORM-created Source rows hit a real test DB instead of crashing (deposit) or polluting the dev DB (render) - refresh the 0009 migration header timestamp - CHANGELOG entry under Unreleased describing the deposit groundwork
Refs ifgi#63 (item 5). The render step now overwrites `related_identifiers` on every invocation with the three live download endpoints on optimap.science (geojson / geopackage / csv), derived from settings.BASE_URL + the URL config. Any stale identifiers from a previous render (e.g. localhost URLs left over from a dev run) are discarded, so a deposit can never publish links that only work on a developer's machine. Each entry uses scheme=url, relation=isSupplementTo, resource_type=dataset. Source-level "describes" entries land in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (item 6 / 2025-07-14 comment). Per harvested Source, the render step now adds a related_identifiers entry with relation=describes, resource_type=publication — wording straight from nuest's 2025-07-14 comment ("This record describes Journal X"). Scheme picked in order: 1. issn — Source.issn_l (linking ISSN) 2. url — Source.homepage_url canonicalised 3. url — Source.url_field canonicalised Self-references to optimap.science are skipped (the portal isn't a journal it describes), and duplicates collapse on the resolved (scheme, identifier) pair so two Source rows pointing at the same journal collapse to one entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (item 4). The deposit's file list now covers every output of regenerate_data_dumps: geojson, geojson.gz, gpkg, csv, and csv.gz. Previously only geojson(.gz) and gpkg shipped — CSV (issue #206) had been added on main but no one told Zenodo about it. The helper now also picks the newest cycle by timestamp when several co-exist in the same dir, so a deposit can't ship a stale .gpkg next to a fresh .geojson. README.md and optimap-main.zip still come from data_dir (where render writes them); data dumps prefer data_dir first (tests / single-dir layouts) and fall back to /tmp/optimap_cache (the default cache dir for production regenerate runs). dump_dir is a parameter so other callers can override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (last checklist item). The render step previously swallowed every error from `git archive HEAD` and then wrote a 0-byte optimap-main.zip as a "fallback", so a missing git binary, a non-repo working directory, or a `CalledProcessError` would all produce an empty zip that the deposit then uploaded to Zenodo under a "success" status. Now: - FileNotFoundError (`git` not on PATH) → RuntimeError with a clear hint. - CalledProcessError → RuntimeError including the exit code and stderr. - subprocess.run exits 0 but the file is missing or 0 bytes → RuntimeError with the stderr (covers SIGPIPE / corrupt repo / empty tree cases). The tests are adjusted to write a small non-empty stub zip in the patched subprocess.run, and gain two new cases for the failure paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (comment 2025-07-14, comment 2025-07-21). README codebook expands to cover every Work field that ends up in the data dumps — including the ones added since the original Zenodo branch landed: `type`, `authors`, `keywords`, `topics`, `bok_concepts`, `placename`, `country_code`, `volume`/`issue`/`first_page`/`last_page`, `openalex_*`. A short note up front states that the same field names appear verbatim as GeoJSON `Feature.properties`, CSV column headers and GeoPackage attribute columns, with CSV using `WKT` for geometry. Default keywords now include `Open Research Information` alongside `ORI` so the record is findable under either label, per the issue comment. A new `additional_descriptions[type=notes]` entry documents the CC0-1.0 / GPL-3.0 license split with the actual file scopes — README + optimap_data_dump_*.{geojson,geojson.gz,gpkg,csv,csv.gz} under CC0, optimap-main.zip under GPL-3.0. Default `patch_fields` in `deposit_to_zenodo` (and the deposit_zenodo command) is extended so the note actually gets pushed. The render test now copies the real README.md.j2 from the source tree into the tmp project root instead of using a tiny stub, so codebook and prose assertions exercise the production template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 (2025-08-21 issue comment, Q2 decision). Renders now include a structured `grants` block with the two OPTIMAP grant IDs in OpenAIRE format: - OPTIMETA: 10.13039/501100002347::16TOA028B (BMBF) - KOMET: 10.13039/501100002347::16KOA009A (BMFTR) NFDI4Earth is deliberately excluded per the August comment. Zenodo's curated grants vocabulary doesn't cover every grant — when the metadata PUT returns 400 mentioning `grants`, the deposit now retries once with `grants` removed and prepends a free-text "Funding: …" paragraph to `metadata.notes`, so the funding info is still discoverable even if Zenodo can't resolve the IDs structurally. The fallback is recorded on ZenodoDepositionLog.notes for the admin email. `grants` is added to the default `--patch` list on `deposit_zenodo`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…file Refs ifgi#63. The version counter (v1, v2, v3, …) is now read from the latest successful ZenodoDepositionLog row for the current api_base instead of data/last_version.txt. The file had three problems: - it lived in the project tree but was never committed, so a fresh checkout silently restarted at v1 - sandbox and production runs shared the same counter, so a stream of sandbox renders would jump production's next version into double digits - a failed deposit still bumped the file, burning a version number that never reached Zenodo The new logic filters ZenodoDepositionLog by (status='success', api_base=…), takes the latest `version`, and emits N+1. Sandbox and production increment independently. Failed deposits don't advance the counter. render_zenodo_package gains an optional api_base argument with the same env/settings cascade as deposit_to_zenodo. deposit_to_zenodo now reads log_entry.version from the rendered zenodo_dynamic.json instead of the tracking file. The model and migration help_text are updated to match; .gitignore drops the now- obsolete data/last_version.txt entry; the integration tests stop seeding the file too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs ifgi#63 Make deposition_id optional in deposit_to_zenodo(): if not passed, fall back to the latest successful ZenodoDepositionLog for the same api_base; if there is no prior log either, bootstrap a fresh draft via POST /deposit/depositions. When the resolved record is already published (submitted=true + state="done"), POST .../actions/newversion and switch to the new draft from links.latest_draft before uploading. The admin action and both management commands drop their "no deposition ID" guards. Wrap the full cycle (regenerate dumps → render package → deposit) in works.tasks.run_zenodo_deposition and add a `schedule_zenodo_deposit` management command that idempotently registers it as a yearly Django-Q schedule for Dec 31 23:59. Publishing remains manual.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enables automatic data package generation and deposition to Zenodo, ensuring long-term preservation and citability for OPTIMAP data.
This comprehensive feature introduces:
Data Package Generation
render_zenodocommand that builds essential artifacts: a dynamicREADME.md(generated from a Jinja2 template with live statistics and source information), anoptimap-main.zipcontaining the project's source code snapshot, and azenodo_dynamic.jsonfile for flexible metadata updates.last_version.txt.Zenodo Deposition Management
deposit_zenodocommand for updating existing Zenodo draft depositions. It intelligently merges metadata, protecting crucial identifiers like DOIs, and ensures a clean slate by deleting previous files before uploading new ones.zenodo_depositcommand simplifies the workflow by executing both the rendering and deposition steps sequentially.Logging, Monitoring, and Notifications
ZenodoDepositionLogmodel records every deposition attempt, tracking status, uploaded files, total size, duration, and any errors encountered./datapublic page now prominently displays information about the latest successful Zenodo deposition, with environment-aware display (sandbox in DEBUG, production otherwise).Streamlined Administration
Configuration
ZENODO_API_TOKEN,ZENODO_SANDBOX_DEPOSITION_ID,ZENODO_API_BASE) are introduced for flexible environment configuration, supported by atests/.env.template.Enhanced Testing
Relates to #63