Establish an architectural design process for Simulation Orchestration Overhaul

## Summary

We need to establish a structured, documented design process for overhauling the simulation job orchestration architecture. The goals of this effort are:

1. **Real-time partial results** — let researchers inspect intermediate results while a simulation is still running, so misconfigurations are caught early instead of after hours of waiting.
2. **Horizontal scalability** — enable a single job to span multiple machines in parallel, so production deployments fully leverage available compute on both HPC clusters and cloud environments.
3. **Optimized result transport** — replace the current pattern of serializing multi-gigabyte result files to JSON and routing them through a message broker with S3-compatible object storage and efficient binary formats (Parquet, ROOT), eliminating I/O bottlenecks.
4. **Optimized result merge** — replace the pure-Python result averaging with high-performance numerical libraries (NumPy, Polars), reducing merge time and memory usage.
5. **Enhanced stability and control** — integrate production-grade retry logic, robust centralized logging, and improved error handling.

This is a multi-month, multi-person effort that will span primarily `yaptide/yaptide` and this documentation repo, with possible touches to `yaptide/deploy` and `yaptide/ui`.

This issue also establishes **a reusable pattern** for future similar architectural processes (e.g. a UI architecture rework), so the structure and conventions chosen here should be general enough to serve as a template.

## Why document the design process here

The `for_developers` repo is described as _"All documentation on code from all repositories"_ and already has `architecture/` and `backend/` sections. Rather than creating yet another repo in an already-scattered organization (17 repos), we keep all architectural reasoning in one place that is:

- **Human-browsable** — rendered on `yaptide.github.io/for_developers`
- **LLM-ingestible** — plain Markdown/MDX files in git, with structured YAML frontmatter, easily consumed by Copilot, Claude, MCP agents, or any future AI tooling
- **Version-controlled** — full history of how and why decisions were made
- **Cross-referenced** — links to issues across `yaptide/yaptide`, `yaptide/deploy`, etc.

## Proposed documentation structure

```
src/content/docs/
├── rework-orchestration/                    # ← this effort
│   ├── index.mdx                            # Vision & goals (the "why")
│   ├── context/
│   │   ├── current-architecture.mdx         # As-is analysis (from architecture/overview.md & data-flow.md)
│   │   ├── current-bottlenecks.mdx          # Profiling data: JSON serialization, pure-Python merge, broker limits
│   │   ├── deployment-constraints.mdx       # HPC (PLGrid/SLURM) vs cloud (Docker Compose) resource realities
│   │   └── user-requirements.mdx            # What researchers actually need from partial results & monitoring
│   ├── adr/
│   │   ├── index.mdx                        # ADR registry: table of all decisions with status
│   │   ├── adr-001-s3-for-partial-results.mdx
│   │   ├── adr-002-binary-format-selection.mdx
│   │   ├── adr-003-merge-algorithm.mdx
│   │   ├── adr-004-retry-and-logging.mdx
│   │   ├── adr-005-horizontal-scaling-model.mdx
│   │   └── ...
│   ├── design/
│   │   ├── partial-results-streaming.mdx    # Detailed design: how workers push partial results to S3
│   │   ├── horizontal-scaling-model.mdx     # How a job fans out across nodes
│   │   └── result-merge-pipeline.mdx        # NumPy/Polars merge architecture
│   └── research/
│       ├── session-001-data-flow-analysis.mdx
│       ├── session-002-s3-vs-redis-tradeoffs.mdx
│       └── ...                              # AI-assisted reasoning session logs
```

This pattern should be reusable. A future UI rework would live under `src/content/docs/rework-ui/` with the same `context/`, `adr/`, `design/`, `research/` substructure.

### File conventions

Each file should use YAML frontmatter for metadata:

```yaml
---
title: "ADR-001: Use S3-compatible storage for partial simulation results"
status: proposed          # proposed → accepted → superseded → rejected
date: 2026-04-15
authors: [author1, author2]
tags: [storage, partial-results, s3, minio]
related-issues:
  - yaptide/yaptide#NNN
  - yaptide/deploy#NNN
supersedes: null
superseded-by: null
---
```

### ADR template (based on MADR)

Each ADR should follow this structure:

```markdown
## Context

What is the problem? What forces are at play?

## Decision Drivers

- Driver 1
- Driver 2

## Considered Options

1. Option A
2. Option B
3. Option C

## Decision Outcome

Chosen option: "Option X", because ...

## Consequences

### Positive
- ...

### Negative
- ...

### Neutral
- ...
```

## Sidebar integration

Add to `astro.config.mjs`:

```javascript
{
  label: "Orchestration Rework",
  collapsed: false,
  items: [
    { label: "Vision", slug: "rework-orchestration" },
    {
      label: "Context & Constraints",
      autogenerate: { directory: "rework-orchestration/context" },
    },
    {
      label: "Architecture Decisions",
      autogenerate: { directory: "rework-orchestration/adr" },
    },
    {
      label: "Design Documents",
      autogenerate: { directory: "rework-orchestration/design" },
    },
    {
      label: "Research Sessions",
      autogenerate: { directory: "rework-orchestration/research" },
    },
  ],
},
```

## Plan of action

### Phase 0 — Bootstrap the structure (this issue)

- [ ] Create the directory structure under `src/content/docs/rework-orchestration/`
- [ ] Add the `index.mdx` vision document with the goals listed above
- [ ] Add the ADR index (`adr/index.mdx`) with an empty registry table
- [ ] Add the ADR template as `adr/_template.mdx` (not rendered, used as copy-paste source)
- [ ] Add the sidebar entry in `astro.config.mjs`
- [ ] Create a tracking issue in `yaptide/yaptide` and an org-level GitHub Project to coordinate cross-repo work
- [ ] Agree on labels to use across repos (e.g. `rework-orchestration`, `partial-results`, `horizontal-scaling`)

### Phase 1 — Capture the current state (AI-assisted)

Use Copilot / deep-research agents to analyze the existing codebase and populate the `context/` section:

- [ ] **`current-architecture.mdx`** — Have an AI agent trace the full simulation lifecycle through `yaptide/yaptide` source code (Celery chord setup, worker communication, merge logic, DB storage). Cross-reference with the existing [`architecture/overview.md`](https://github.com/yaptide/for_developers/blob/main/src/content/docs/architecture/overview.md) and [`architecture/data-flow.md`](https://github.com/yaptide/for_developers/blob/main/src/content/docs/architecture/data-flow.md). Document what the code *actually does* today, not just what the docs say.
- [ ] **`current-bottlenecks.mdx`** — Identify and quantify pain points: JSON serialization overhead for large results, pure-Python merge performance, Redis broker message size limits, lack of partial result visibility. Include concrete numbers where possible (file sizes, timings, memory usage).
- [ ] **`deployment-constraints.mdx`** — Document the realities of each deployment target: PLGrid/SLURM (no persistent services, SSH-only access, shared filesystem), cloud/Docker Compose (full control, can add services like MinIO). Capture what is and isn't possible in each environment.
- [ ] **`user-requirements.mdx`** — Gather input from researchers: what do they actually want to see during a running simulation? How quickly? What decisions would partial results enable? What is "good enough"?

### Phase 2 — Explore options and make decisions (AI-assisted reasoning)

For each major decision, conduct a structured AI-assisted research session:

- [ ] **Research session: result storage** — Discuss S3/MinIO vs Redis vs shared filesystem vs database BLOBs for partial results. Consider both deployment targets. Save the session log to `research/`, distill the outcome into `adr/adr-001-s3-for-partial-results.mdx`.
- [ ] **Research session: binary format** — Compare Parquet vs ROOT vs HDF5 vs msgpack for result serialization. Evaluate: write speed, read speed, compression ratio, Python ecosystem support, browser-side readability. Distill into `adr/adr-002-binary-format-selection.mdx`.
- [ ] **Research session: merge algorithm** — Profile current pure-Python merge. Benchmark NumPy, Polars, and Dask alternatives on representative data. Distill into `adr/adr-003-merge-algorithm.mdx`.
- [ ] **Research session: horizontal scaling** — Explore how a single job can span multiple machines. Celery multi-node vs Dask distributed vs custom coordination. Implications for both SLURM and cloud. Distill into `adr/adr-005-horizontal-scaling-model.mdx`.
- [ ] **Research session: retry & logging** — Evaluate Celery retry patterns, dead-letter queues, structured logging (structlog), centralized log aggregation. Distill into `adr/adr-004-retry-and-logging.mdx`.

Each research session should:
1. Start with a clear question and relevant context files fed to the AI
2. Explore at least 3 options with pros/cons
3. End with a recommendation
4. Be saved as a `research/session-NNN-*.mdx` file for auditability
5. Result in a corresponding ADR with status `proposed`

### Phase 3 — Detailed design

Based on accepted ADRs, write detailed design documents:

- [ ] **`design/partial-results-streaming.mdx`** — Worker → S3 upload flow, API endpoints for partial result retrieval, UI polling/streaming strategy
- [ ] **`design/horizontal-scaling-model.mdx`** — Job fan-out across nodes, task coordination, failure handling
- [ ] **`design/result-merge-pipeline.mdx`** — New merge algorithm, data format in/out, memory budget, benchmarks

### Phase 4 — Implementation tracking

- [ ] Create sub-issues in `yaptide/yaptide` for each implementation work package, linked to the corresponding ADR and design document
- [ ] Track progress in the org-level GitHub Project
- [ ] Update ADR statuses as implementation proceeds (proposed → accepted, or → rejected with reasoning)

### Phase 5 — Retrospective

- [ ] After the main implementation is complete, update the vision document with outcomes
- [ ] Mark ADRs that were superseded during implementation
- [ ] Write a retrospective on the design process itself — what worked, what to improve for the next rework (e.g. UI)

## Relation to other issues

- #45 — ASCII diagrams in docs should be replaced with Mermaid; the new `rework-orchestration/` docs should use Mermaid from the start
- #48 — Developer onboarding page; the orchestration rework docs will serve as onboarding material for contributors to this effort
- #8 — Check if anything should be included from other repos; this process will explicitly link to `yaptide/yaptide`, `yaptide/deploy`, etc.

## Notes on AI-assisted workflow

The `research/` folder is key to making this process work with AI tools. Each session file should:
- State the **question** being investigated
- List the **context files** that were provided to the AI (with repo links)
- Capture the **key insights and reasoning**, not raw transcripts
- End with a **clear conclusion or set of options**
- Use plain Markdown with structured headings — this makes the file itself useful as context for future AI sessions

This creates a compounding knowledge base: each new AI session can be given the vision doc + relevant context files + previous research sessions, enabling increasingly informed reasoning over time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish an architectural design process for Simulation Orchestration Overhaul #66

Summary

Why document the design process here

Proposed documentation structure

File conventions

ADR template (based on MADR)

Sidebar integration

Plan of action

Phase 0 — Bootstrap the structure (this issue)

Phase 1 — Capture the current state (AI-assisted)

Phase 2 — Explore options and make decisions (AI-assisted reasoning)

Phase 3 — Detailed design

Phase 4 — Implementation tracking

Phase 5 — Retrospective

Relation to other issues

Notes on AI-assisted workflow

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Establish an architectural design process for Simulation Orchestration Overhaul #66

Description

Summary

Why document the design process here

Proposed documentation structure

File conventions

ADR template (based on MADR)

Sidebar integration

Plan of action

Phase 0 — Bootstrap the structure (this issue)

Phase 1 — Capture the current state (AI-assisted)

Phase 2 — Explore options and make decisions (AI-assisted reasoning)

Phase 3 — Detailed design

Phase 4 — Implementation tracking

Phase 5 — Retrospective

Relation to other issues

Notes on AI-assisted workflow

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions