You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to establish a structured, documented design process for overhauling the simulation job orchestration architecture. The goals of this effort are:
Real-time partial results — let researchers inspect intermediate results while a simulation is still running, so misconfigurations are caught early instead of after hours of waiting.
Horizontal scalability — enable a single job to span multiple machines in parallel, so production deployments fully leverage available compute on both HPC clusters and cloud environments.
Optimized result transport — replace the current pattern of serializing multi-gigabyte result files to JSON and routing them through a message broker with S3-compatible object storage and efficient binary formats (Parquet, ROOT), eliminating I/O bottlenecks.
Optimized result merge — replace the pure-Python result averaging with high-performance numerical libraries (NumPy, Polars), reducing merge time and memory usage.
Enhanced stability and control — integrate production-grade retry logic, robust centralized logging, and improved error handling.
This is a multi-month, multi-person effort that will span primarily yaptide/yaptide and this documentation repo, with possible touches to yaptide/deploy and yaptide/ui.
This issue also establishes a reusable pattern for future similar architectural processes (e.g. a UI architecture rework), so the structure and conventions chosen here should be general enough to serve as a template.
Why document the design process here
The for_developers repo is described as "All documentation on code from all repositories" and already has architecture/ and backend/ sections. Rather than creating yet another repo in an already-scattered organization (17 repos), we keep all architectural reasoning in one place that is:
Human-browsable — rendered on yaptide.github.io/for_developers
LLM-ingestible — plain Markdown/MDX files in git, with structured YAML frontmatter, easily consumed by Copilot, Claude, MCP agents, or any future AI tooling
Version-controlled — full history of how and why decisions were made
Cross-referenced — links to issues across yaptide/yaptide, yaptide/deploy, etc.
This pattern should be reusable. A future UI rework would live under src/content/docs/rework-ui/ with the same context/, adr/, design/, research/ substructure.
File conventions
Each file should use YAML frontmatter for metadata:
## Context
What is the problem? What forces are at play?
## Decision Drivers- Driver 1
- Driver 2
## Considered Options1. Option A
2. Option B
3. Option C
## Decision Outcome
Chosen option: "Option X", because ...
## Consequences### Positive- ...
### Negative- ...
### Neutral- ...
Create the directory structure under src/content/docs/rework-orchestration/
Add the index.mdx vision document with the goals listed above
Add the ADR index (adr/index.mdx) with an empty registry table
Add the ADR template as adr/_template.mdx (not rendered, used as copy-paste source)
Add the sidebar entry in astro.config.mjs
Create a tracking issue in yaptide/yaptide and an org-level GitHub Project to coordinate cross-repo work
Agree on labels to use across repos (e.g. rework-orchestration, partial-results, horizontal-scaling)
Phase 1 — Capture the current state (AI-assisted)
Use Copilot / deep-research agents to analyze the existing codebase and populate the context/ section:
current-architecture.mdx — Have an AI agent trace the full simulation lifecycle through yaptide/yaptide source code (Celery chord setup, worker communication, merge logic, DB storage). Cross-reference with the existing architecture/overview.md and architecture/data-flow.md. Document what the code actually does today, not just what the docs say.
current-bottlenecks.mdx — Identify and quantify pain points: JSON serialization overhead for large results, pure-Python merge performance, Redis broker message size limits, lack of partial result visibility. Include concrete numbers where possible (file sizes, timings, memory usage).
deployment-constraints.mdx — Document the realities of each deployment target: PLGrid/SLURM (no persistent services, SSH-only access, shared filesystem), cloud/Docker Compose (full control, can add services like MinIO). Capture what is and isn't possible in each environment.
user-requirements.mdx — Gather input from researchers: what do they actually want to see during a running simulation? How quickly? What decisions would partial results enable? What is "good enough"?
Phase 2 — Explore options and make decisions (AI-assisted reasoning)
For each major decision, conduct a structured AI-assisted research session:
Research session: result storage — Discuss S3/MinIO vs Redis vs shared filesystem vs database BLOBs for partial results. Consider both deployment targets. Save the session log to research/, distill the outcome into adr/adr-001-s3-for-partial-results.mdx.
Research session: binary format — Compare Parquet vs ROOT vs HDF5 vs msgpack for result serialization. Evaluate: write speed, read speed, compression ratio, Python ecosystem support, browser-side readability. Distill into adr/adr-002-binary-format-selection.mdx.
Research session: merge algorithm — Profile current pure-Python merge. Benchmark NumPy, Polars, and Dask alternatives on representative data. Distill into adr/adr-003-merge-algorithm.mdx.
Research session: horizontal scaling — Explore how a single job can span multiple machines. Celery multi-node vs Dask distributed vs custom coordination. Implications for both SLURM and cloud. Distill into adr/adr-005-horizontal-scaling-model.mdx.
Start with a clear question and relevant context files fed to the AI
Explore at least 3 options with pros/cons
End with a recommendation
Be saved as a research/session-NNN-*.mdx file for auditability
Result in a corresponding ADR with status proposed
Phase 3 — Detailed design
Based on accepted ADRs, write detailed design documents:
design/partial-results-streaming.mdx — Worker → S3 upload flow, API endpoints for partial result retrieval, UI polling/streaming strategy
design/horizontal-scaling-model.mdx — Job fan-out across nodes, task coordination, failure handling
design/result-merge-pipeline.mdx — New merge algorithm, data format in/out, memory budget, benchmarks
Phase 4 — Implementation tracking
Create sub-issues in yaptide/yaptide for each implementation work package, linked to the corresponding ADR and design document
Track progress in the org-level GitHub Project
Update ADR statuses as implementation proceeds (proposed → accepted, or → rejected with reasoning)
Phase 5 — Retrospective
After the main implementation is complete, update the vision document with outcomes
Mark ADRs that were superseded during implementation
Write a retrospective on the design process itself — what worked, what to improve for the next rework (e.g. UI)
Relation to other issues
Replace ASCII diagrams with Mermaid #45 — ASCII diagrams in docs should be replaced with Mermaid; the new rework-orchestration/ docs should use Mermaid from the start
Developer onboarding page #48 — Developer onboarding page; the orchestration rework docs will serve as onboarding material for contributors to this effort
The research/ folder is key to making this process work with AI tools. Each session file should:
State the question being investigated
List the context files that were provided to the AI (with repo links)
Capture the key insights and reasoning, not raw transcripts
End with a clear conclusion or set of options
Use plain Markdown with structured headings — this makes the file itself useful as context for future AI sessions
This creates a compounding knowledge base: each new AI session can be given the vision doc + relevant context files + previous research sessions, enabling increasingly informed reasoning over time.
Summary
We need to establish a structured, documented design process for overhauling the simulation job orchestration architecture. The goals of this effort are:
This is a multi-month, multi-person effort that will span primarily
yaptide/yaptideand this documentation repo, with possible touches toyaptide/deployandyaptide/ui.This issue also establishes a reusable pattern for future similar architectural processes (e.g. a UI architecture rework), so the structure and conventions chosen here should be general enough to serve as a template.
Why document the design process here
The
for_developersrepo is described as "All documentation on code from all repositories" and already hasarchitecture/andbackend/sections. Rather than creating yet another repo in an already-scattered organization (17 repos), we keep all architectural reasoning in one place that is:yaptide.github.io/for_developersyaptide/yaptide,yaptide/deploy, etc.Proposed documentation structure
This pattern should be reusable. A future UI rework would live under
src/content/docs/rework-ui/with the samecontext/,adr/,design/,research/substructure.File conventions
Each file should use YAML frontmatter for metadata:
ADR template (based on MADR)
Each ADR should follow this structure:
Sidebar integration
Add to
astro.config.mjs:Plan of action
Phase 0 — Bootstrap the structure (this issue)
src/content/docs/rework-orchestration/index.mdxvision document with the goals listed aboveadr/index.mdx) with an empty registry tableadr/_template.mdx(not rendered, used as copy-paste source)astro.config.mjsyaptide/yaptideand an org-level GitHub Project to coordinate cross-repo workrework-orchestration,partial-results,horizontal-scaling)Phase 1 — Capture the current state (AI-assisted)
Use Copilot / deep-research agents to analyze the existing codebase and populate the
context/section:current-architecture.mdx— Have an AI agent trace the full simulation lifecycle throughyaptide/yaptidesource code (Celery chord setup, worker communication, merge logic, DB storage). Cross-reference with the existingarchitecture/overview.mdandarchitecture/data-flow.md. Document what the code actually does today, not just what the docs say.current-bottlenecks.mdx— Identify and quantify pain points: JSON serialization overhead for large results, pure-Python merge performance, Redis broker message size limits, lack of partial result visibility. Include concrete numbers where possible (file sizes, timings, memory usage).deployment-constraints.mdx— Document the realities of each deployment target: PLGrid/SLURM (no persistent services, SSH-only access, shared filesystem), cloud/Docker Compose (full control, can add services like MinIO). Capture what is and isn't possible in each environment.user-requirements.mdx— Gather input from researchers: what do they actually want to see during a running simulation? How quickly? What decisions would partial results enable? What is "good enough"?Phase 2 — Explore options and make decisions (AI-assisted reasoning)
For each major decision, conduct a structured AI-assisted research session:
research/, distill the outcome intoadr/adr-001-s3-for-partial-results.mdx.adr/adr-002-binary-format-selection.mdx.adr/adr-003-merge-algorithm.mdx.adr/adr-005-horizontal-scaling-model.mdx.adr/adr-004-retry-and-logging.mdx.Each research session should:
research/session-NNN-*.mdxfile for auditabilityproposedPhase 3 — Detailed design
Based on accepted ADRs, write detailed design documents:
design/partial-results-streaming.mdx— Worker → S3 upload flow, API endpoints for partial result retrieval, UI polling/streaming strategydesign/horizontal-scaling-model.mdx— Job fan-out across nodes, task coordination, failure handlingdesign/result-merge-pipeline.mdx— New merge algorithm, data format in/out, memory budget, benchmarksPhase 4 — Implementation tracking
yaptide/yaptidefor each implementation work package, linked to the corresponding ADR and design documentPhase 5 — Retrospective
Relation to other issues
rework-orchestration/docs should use Mermaid from the startyaptide/yaptide,yaptide/deploy, etc.Notes on AI-assisted workflow
The
research/folder is key to making this process work with AI tools. Each session file should:This creates a compounding knowledge base: each new AI session can be given the vision doc + relevant context files + previous research sessions, enabling increasingly informed reasoning over time.