MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
Open
MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
Conversation
New features: - Streaming API for automated lab data ingestion - POST /stream/create, /stream/:id/upload, /stream/:id/close - Support for local, Globus HTTPS, and S3 storage backends - File preview without full download (CSV stats, JSON structure) - Server-side curation workflow - GET /curation/pending - List submissions awaiting review - POST /curation/:id/approve - Approve with DOI minting - POST /curation/:id/reject - Reject with reason - Full curation history tracking - DOI minting via DataCite API - Mock client for local development - Real client for production deployment - Simplified Globus Flow - Removed curation steps (now handled by server) - Keeps: email notification, file transfer, user notification - Deployment tooling - deploy.sh for local dev and AWS SAM deployment - SAM template for Lambda + API Gateway Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…t, and curation - Add publication orchestration: approval triggers DOI mint + Globus Search ingest + status update - Fix status transitions: submissions land as pending_curation (not submitted) - Add data source URL validation on submit (globus://, https://, stream://) - Wire DataCite credentials into SAM template, samconfig, and deploy script - Add DataCite test_connection() diagnostic method - Add GlobusSearchClient + MockGlobusSearchClient with factory pattern - Update search to try Globus Search first, fallback to DynamoDB scan - Add Search index UUID params and USE_MOCK_SEARCH to both Lambda functions - Refactor app into FastAPI router modules with auth, middleware, models - Add async job system (inline/SQS/SQLite) with publish_submission job type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test_v2_publish_pipeline.py: 19 tests covering status transitions, data source validation, inline/async publish pipeline, and mock search client - Fix test_v2_async_jobs.py: update doi_job → publish_job, remove stale status/update call (submissions now land as pending_curation directly) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs prevented search ingest from working: 1. oauth2_client_credentials_tokens() was called without requested_scopes, so no search.api.globus.org token was obtained (access_token was None) 2. License Pydantic model was passed directly into GMetaEntry content, causing JSON serialization failure in the Globus SDK Also: clean up samconfig.toml duplicates, add search index UUIDs, add configurable CORS origins for prod, fix two pre-existing test bugs, and add operational scripts (search index permissions, DataCite SSM setup, search token diagnostics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c DOIs
- Add dataset_doi field to SQLite schema (store.py) with migration
- Propagate dataset_doi from prior published versions on update submissions
- Refactor _mint_doi_for_submission for version-aware DOI logic:
- First version: mint dataset DOI (stored as both doi and dataset_doi)
- Subsequent + mint_doi=False: inherit dataset_doi, update DataCite metadata
- Subsequent + mint_doi=True: mint version DOI with -v{ver} suffix,
add IsVersionOf/HasVersion relatedIdentifiers
- Add related_identifiers support and update_metadata() to DataCiteClient
- Add dataset_doi and version_count to Globus Search GMetaEntry
- dc.doi falls back to dataset_doi when no version-specific DOI
- 13 new tests covering full lifecycle, DOI propagation, search index,
mock DataCite, and curation logic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support domains (List[str]) for scientific domain categorization and external_doi/external_url/external_source for tracking provenance of externally-published datasets imported into MDF. Fields round-trip through submit → status and are indexed in Globus Search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ignore Set up prod samconfig.toml with test DataCite credentials (Globus.TEST) and test Globus Search index for initial production stack deployment. Add .aws-sam/ and .DS_Store to .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover dev/staging/prod deployment, SSM parameters, switching to real credentials, quick deploy, local dev, tests, and architecture overview. Note that prod currently uses test credentials. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server-side: - Add 24h deadline to Globus TransferData so tasks auto-cancel - Replace async JOB_CHECK_TRANSFER with inline status check on /status endpoint - Add JOB_CLEANUP_TRANSFERS scheduled job (EventBridge, every 6h) for stale transfer ACL cleanup - Store transfer_initiated_at for age tracking - Add scan_by_transfer_status to DynamoDB/SQLite store - Handle EventBridge events in async worker Lambda - Bump async worker timeout default to 120s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion - Add version, previous_version, root_version, latest fields to DatasetMetadata - Add download_url field to DatasetMetadata - Propagate versioning and download_url into search index entries - Include latest and root_version in search result formatting - Fix v1 migration to handle organizations list (plural) from mdf block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demo showcases: global config, direct publish with Globus HTTPS upload, status from config memory, repo mode, pending/approve, and search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire CURATOR_GROUP_IDS and REQUIRED_GROUP_MEMBERSHIP into SAM template env vars for both Lambdas. Add require_submitter dependency on /submit endpoint that checks Globus group membership in production and bypasses in dev-auth mode. Dev overrides set both to empty string so local dev and tests are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mps minor New data_sources in an update → major version bump (1.0 → 2.0). Metadata-only update → minor bump (1.0 → 1.1) with data_sources inherited from the prior version automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stores the size of the auto-generated zip archive alongside download_url for display in dataset cards and clone flows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search:
- Add faceted search via Globus Search post_search (year, org, author, keyword, domain)
- Add filter query params to GET /search endpoint
- Fix total field returning page size instead of global count
- Add unauthenticated read client for public MDF index
- Return full result fields: keywords, description, doi, publication_year, domains, license, size_bytes, file_count
- Add batch_ingest() using GMetaList (10x faster than per-record ingest)
- Add ingest_converted_datasets.py script
Card/detail endpoints:
- Add ?version=latest support to /card/{source_id} and /detail/{slug}
- Fix /detail/{slug} to respect ?version= query param (was silently ignored)
- Return full description (was truncated at 300 chars)
- Return all keywords (was capped at 10)
- Remove view_count increment from citation endpoint (citation != page view)
Submissions:
- Fix edit_metadata: new minor version inherits editor's user_id, not original submitter
(ensures editor can modify the new version they created)
Docs:
- Add FRONTEND_API.md with full API reference for frontend consumers
- Update v2.md with faceted search, slug behavior, and versioning notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add email_utils.py: AWS SES notifications for new submission (curators),
approved/published (submitter), and rejected (submitter with feedback)
- Wire notify_curators_new_submission() into POST /submit
- Wire notify_submitter_rejected() into POST /curation/{id}/reject
- Wire notify_submitter_approved() into async_jobs._process_publish_submission()
so the email fires when status actually reaches "published" (covers both
inline dev mode and SQS async prod mode)
- Curation review URL links to CURATION_PORTAL_URL (not per-dataset path)
- Add EnableEmails SAM parameter (default: false) — when false, SES_FROM_EMAIL
is blanked in both Lambdas so emails are silently skipped; set to true only
on prod once SES sandbox access is lifted in the company account
- Add SesFromEmail, CuratorEmails, PortalUrl, CurationPortalUrl SAM parameters
- Add ses:SendEmail IAM policy to both ApiFunction and AsyncWorkerFunction
- Update v2.md: email notifications section, production deploy checklist,
env vars table, fix stale view_count note (citation no longer counted)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _escape() now handles " → " for safe use in HTML attributes
- Escape URLs in href attributes (_cta_button, footer link)
- Rejection email links to /submissions dashboard instead of
/detail/{source_id} which 404s for non-published datasets
- Fix stale docstring URLs (app. → www., /curate → /curation)
- Remove unused Optional import
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Infrastructure
template.yamldeploy.shsamconfig.tomlVerified on staging
Health check, stream CRUD, file upload via Globus HTTPS, snapshot, repo publish, DOI minting, Globus Search ingest, dataset cards, citations — all passing E2E.
Test plan
cd aws && python -m pytest v2/test_v2_*.py -v— all unit/integration tests pass./deploy.sh staging— staging stack deploys cleanly./deploy.sh prod— prod stack deploys with test credentials🤖 Generated with Claude Code