Skip to content

MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133

Open
blaiszik wants to merge 19 commits intoprodfrom
v2-backend-curation
Open

MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
blaiszik wants to merge 19 commits intoprodfrom
v2-backend-curation

Conversation

@blaiszik
Copy link
Contributor

Summary

  • Complete v2 backend built on FastAPI + Mangum (single Lambda), deployed via SAM to AWS
  • Full publication pipeline: submit → pending_curation → approved (DOI minted) → published (Globus Search indexed)
  • Streaming endpoints: create, append (file upload to Globus HTTPS), close, snapshot (stream → dataset)
  • Dataset versioning with DOI inheritance and version-specific DOIs
  • Curation workflow with approve/reject, curator guards, and ownership enforcement
  • Real DataCite DOI minting (test credentials) and Globus Search ingest (test index)
  • Security hardening: path traversal protection, ownership checks, rate limiting, input size limits, structured logging
  • Cost-optimized: right-sized Lambda/concurrency, bounded log retention, capped search scans
  • Async job dispatch via SQS with inline/SQLite modes for testing
  • Comprehensive test suites: hardening, integration, publish pipeline, versioning, async jobs
  • SAM template with dev/staging/prod configs, deploy.sh with quick-deploy and teardown
  • Production config uses test credentials for initial deployment

Infrastructure

Resource Description
template.yaml SAM template: Lambda, API Gateway, DynamoDB, SQS, S3, CloudWatch
deploy.sh Deploy script: dev, staging, prod, quick, local, teardown, logs
samconfig.toml Per-environment config (dev, staging, prod)

Verified on staging

Health check, stream CRUD, file upload via Globus HTTPS, snapshot, repo publish, DOI minting, Globus Search ingest, dataset cards, citations — all passing E2E.

Test plan

  • cd aws && python -m pytest v2/test_v2_*.py -v — all unit/integration tests pass
  • ./deploy.sh staging — staging stack deploys cleanly
  • ./deploy.sh prod — prod stack deploys with test credentials
  • Verify health endpoint on prod API URL
  • Submit test dataset through prod and verify DOI + search ingest

🤖 Generated with Claude Code

blaiszik and others added 19 commits January 31, 2026 22:52
New features:
- Streaming API for automated lab data ingestion
  - POST /stream/create, /stream/:id/upload, /stream/:id/close
  - Support for local, Globus HTTPS, and S3 storage backends
  - File preview without full download (CSV stats, JSON structure)

- Server-side curation workflow
  - GET /curation/pending - List submissions awaiting review
  - POST /curation/:id/approve - Approve with DOI minting
  - POST /curation/:id/reject - Reject with reason
  - Full curation history tracking

- DOI minting via DataCite API
  - Mock client for local development
  - Real client for production deployment

- Simplified Globus Flow
  - Removed curation steps (now handled by server)
  - Keeps: email notification, file transfer, user notification

- Deployment tooling
  - deploy.sh for local dev and AWS SAM deployment
  - SAM template for Lambda + API Gateway

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…t, and curation

- Add publication orchestration: approval triggers DOI mint + Globus Search ingest + status update
- Fix status transitions: submissions land as pending_curation (not submitted)
- Add data source URL validation on submit (globus://, https://, stream://)
- Wire DataCite credentials into SAM template, samconfig, and deploy script
- Add DataCite test_connection() diagnostic method
- Add GlobusSearchClient + MockGlobusSearchClient with factory pattern
- Update search to try Globus Search first, fallback to DynamoDB scan
- Add Search index UUID params and USE_MOCK_SEARCH to both Lambda functions
- Refactor app into FastAPI router modules with auth, middleware, models
- Add async job system (inline/SQS/SQLite) with publish_submission job type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test_v2_publish_pipeline.py: 19 tests covering status transitions,
  data source validation, inline/async publish pipeline, and mock search client
- Fix test_v2_async_jobs.py: update doi_job → publish_job, remove stale
  status/update call (submissions now land as pending_curation directly)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs prevented search ingest from working:
1. oauth2_client_credentials_tokens() was called without requested_scopes,
   so no search.api.globus.org token was obtained (access_token was None)
2. License Pydantic model was passed directly into GMetaEntry content,
   causing JSON serialization failure in the Globus SDK

Also: clean up samconfig.toml duplicates, add search index UUIDs, add
configurable CORS origins for prod, fix two pre-existing test bugs, and
add operational scripts (search index permissions, DataCite SSM setup,
search token diagnostics).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c DOIs

- Add dataset_doi field to SQLite schema (store.py) with migration
- Propagate dataset_doi from prior published versions on update submissions
- Refactor _mint_doi_for_submission for version-aware DOI logic:
  - First version: mint dataset DOI (stored as both doi and dataset_doi)
  - Subsequent + mint_doi=False: inherit dataset_doi, update DataCite metadata
  - Subsequent + mint_doi=True: mint version DOI with -v{ver} suffix,
    add IsVersionOf/HasVersion relatedIdentifiers
- Add related_identifiers support and update_metadata() to DataCiteClient
- Add dataset_doi and version_count to Globus Search GMetaEntry
- dc.doi falls back to dataset_doi when no version-specific DOI
- 13 new tests covering full lifecycle, DOI propagation, search index,
  mock DataCite, and curation logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support domains (List[str]) for scientific domain categorization and
external_doi/external_url/external_source for tracking provenance of
externally-published datasets imported into MDF. Fields round-trip
through submit → status and are indexed in Globus Search.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ignore

Set up prod samconfig.toml with test DataCite credentials (Globus.TEST)
and test Globus Search index for initial production stack deployment.
Add .aws-sam/ and .DS_Store to .gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover dev/staging/prod deployment, SSM parameters, switching to
real credentials, quick deploy, local dev, tests, and architecture
overview. Note that prod currently uses test credentials.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server-side:
- Add 24h deadline to Globus TransferData so tasks auto-cancel
- Replace async JOB_CHECK_TRANSFER with inline status check on /status endpoint
- Add JOB_CLEANUP_TRANSFERS scheduled job (EventBridge, every 6h) for stale transfer ACL cleanup
- Store transfer_initiated_at for age tracking
- Add scan_by_transfer_status to DynamoDB/SQLite store
- Handle EventBridge events in async worker Lambda
- Bump async worker timeout default to 120s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

- Add version, previous_version, root_version, latest fields to DatasetMetadata
- Add download_url field to DatasetMetadata
- Propagate versioning and download_url into search index entries
- Include latest and root_version in search result formatting
- Fix v1 migration to handle organizations list (plural) from mdf block

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demo showcases: global config, direct publish with Globus HTTPS upload,
status from config memory, repo mode, pending/approve, and search.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire CURATOR_GROUP_IDS and REQUIRED_GROUP_MEMBERSHIP into SAM template
env vars for both Lambdas. Add require_submitter dependency on /submit
endpoint that checks Globus group membership in production and bypasses
in dev-auth mode. Dev overrides set both to empty string so local dev
and tests are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mps minor

New data_sources in an update → major version bump (1.0 → 2.0).
Metadata-only update → minor bump (1.0 → 1.1) with data_sources
inherited from the prior version automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stores the size of the auto-generated zip archive alongside download_url
for display in dataset cards and clone flows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search:
- Add faceted search via Globus Search post_search (year, org, author, keyword, domain)
- Add filter query params to GET /search endpoint
- Fix total field returning page size instead of global count
- Add unauthenticated read client for public MDF index
- Return full result fields: keywords, description, doi, publication_year, domains, license, size_bytes, file_count
- Add batch_ingest() using GMetaList (10x faster than per-record ingest)
- Add ingest_converted_datasets.py script

Card/detail endpoints:
- Add ?version=latest support to /card/{source_id} and /detail/{slug}
- Fix /detail/{slug} to respect ?version= query param (was silently ignored)
- Return full description (was truncated at 300 chars)
- Return all keywords (was capped at 10)
- Remove view_count increment from citation endpoint (citation != page view)

Submissions:
- Fix edit_metadata: new minor version inherits editor's user_id, not original submitter
  (ensures editor can modify the new version they created)

Docs:
- Add FRONTEND_API.md with full API reference for frontend consumers
- Update v2.md with faceted search, slug behavior, and versioning notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add email_utils.py: AWS SES notifications for new submission (curators),
  approved/published (submitter), and rejected (submitter with feedback)
- Wire notify_curators_new_submission() into POST /submit
- Wire notify_submitter_rejected() into POST /curation/{id}/reject
- Wire notify_submitter_approved() into async_jobs._process_publish_submission()
  so the email fires when status actually reaches "published" (covers both
  inline dev mode and SQS async prod mode)
- Curation review URL links to CURATION_PORTAL_URL (not per-dataset path)
- Add EnableEmails SAM parameter (default: false) — when false, SES_FROM_EMAIL
  is blanked in both Lambdas so emails are silently skipped; set to true only
  on prod once SES sandbox access is lifted in the company account
- Add SesFromEmail, CuratorEmails, PortalUrl, CurationPortalUrl SAM parameters
- Add ses:SendEmail IAM policy to both ApiFunction and AsyncWorkerFunction
- Update v2.md: email notifications section, production deploy checklist,
  env vars table, fix stale view_count note (citation no longer counted)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _escape() now handles " → &quot; for safe use in HTML attributes
- Escape URLs in href attributes (_cta_button, footer link)
- Rejection email links to /submissions dashboard instead of
  /detail/{source_id} which 404s for non-published datasets
- Fix stale docstring URLs (app. → www., /curate → /curation)
- Remove unused Optional import

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant