Skip to content

Establish an architectural design process for Simulation Orchestration Overhaul #66

@grzanka

Description

@grzanka

Summary

We need to establish a structured, documented design process for overhauling the simulation job orchestration architecture. The goals of this effort are:

  1. Real-time partial results — let researchers inspect intermediate results while a simulation is still running, so misconfigurations are caught early instead of after hours of waiting.
  2. Horizontal scalability — enable a single job to span multiple machines in parallel, so production deployments fully leverage available compute on both HPC clusters and cloud environments.
  3. Optimized result transport — replace the current pattern of serializing multi-gigabyte result files to JSON and routing them through a message broker with S3-compatible object storage and efficient binary formats (Parquet, ROOT), eliminating I/O bottlenecks.
  4. Optimized result merge — replace the pure-Python result averaging with high-performance numerical libraries (NumPy, Polars), reducing merge time and memory usage.
  5. Enhanced stability and control — integrate production-grade retry logic, robust centralized logging, and improved error handling.

This is a multi-month, multi-person effort that will span primarily yaptide/yaptide and this documentation repo, with possible touches to yaptide/deploy and yaptide/ui.

This issue also establishes a reusable pattern for future similar architectural processes (e.g. a UI architecture rework), so the structure and conventions chosen here should be general enough to serve as a template.

Why document the design process here

The for_developers repo is described as "All documentation on code from all repositories" and already has architecture/ and backend/ sections. Rather than creating yet another repo in an already-scattered organization (17 repos), we keep all architectural reasoning in one place that is:

  • Human-browsable — rendered on yaptide.github.io/for_developers
  • LLM-ingestible — plain Markdown/MDX files in git, with structured YAML frontmatter, easily consumed by Copilot, Claude, MCP agents, or any future AI tooling
  • Version-controlled — full history of how and why decisions were made
  • Cross-referenced — links to issues across yaptide/yaptide, yaptide/deploy, etc.

Proposed documentation structure

src/content/docs/
├── rework-orchestration/                    # ← this effort
│   ├── index.mdx                            # Vision & goals (the "why")
│   ├── context/
│   │   ├── current-architecture.mdx         # As-is analysis (from architecture/overview.md & data-flow.md)
│   │   ├── current-bottlenecks.mdx          # Profiling data: JSON serialization, pure-Python merge, broker limits
│   │   ├── deployment-constraints.mdx       # HPC (PLGrid/SLURM) vs cloud (Docker Compose) resource realities
│   │   └── user-requirements.mdx            # What researchers actually need from partial results & monitoring
│   ├── adr/
│   │   ├── index.mdx                        # ADR registry: table of all decisions with status
│   │   ├── adr-001-s3-for-partial-results.mdx
│   │   ├── adr-002-binary-format-selection.mdx
│   │   ├── adr-003-merge-algorithm.mdx
│   │   ├── adr-004-retry-and-logging.mdx
│   │   ├── adr-005-horizontal-scaling-model.mdx
│   │   └── ...
│   ├── design/
│   │   ├── partial-results-streaming.mdx    # Detailed design: how workers push partial results to S3
│   │   ├── horizontal-scaling-model.mdx     # How a job fans out across nodes
│   │   └── result-merge-pipeline.mdx        # NumPy/Polars merge architecture
│   └── research/
│       ├── session-001-data-flow-analysis.mdx
│       ├── session-002-s3-vs-redis-tradeoffs.mdx
│       └── ...                              # AI-assisted reasoning session logs

This pattern should be reusable. A future UI rework would live under src/content/docs/rework-ui/ with the same context/, adr/, design/, research/ substructure.

File conventions

Each file should use YAML frontmatter for metadata:

---
title: "ADR-001: Use S3-compatible storage for partial simulation results"
status: proposed          # proposed → accepted → superseded → rejected
date: 2026-04-15
authors: [author1, author2]
tags: [storage, partial-results, s3, minio]
related-issues:
  - yaptide/yaptide#NNN
  - yaptide/deploy#NNN
supersedes: null
superseded-by: null
---

ADR template (based on MADR)

Each ADR should follow this structure:

## Context

What is the problem? What forces are at play?

## Decision Drivers

- Driver 1
- Driver 2

## Considered Options

1. Option A
2. Option B
3. Option C

## Decision Outcome

Chosen option: "Option X", because ...

## Consequences

### Positive
- ...

### Negative
- ...

### Neutral
- ...

Sidebar integration

Add to astro.config.mjs:

{
  label: "Orchestration Rework",
  collapsed: false,
  items: [
    { label: "Vision", slug: "rework-orchestration" },
    {
      label: "Context & Constraints",
      autogenerate: { directory: "rework-orchestration/context" },
    },
    {
      label: "Architecture Decisions",
      autogenerate: { directory: "rework-orchestration/adr" },
    },
    {
      label: "Design Documents",
      autogenerate: { directory: "rework-orchestration/design" },
    },
    {
      label: "Research Sessions",
      autogenerate: { directory: "rework-orchestration/research" },
    },
  ],
},

Plan of action

Phase 0 — Bootstrap the structure (this issue)

  • Create the directory structure under src/content/docs/rework-orchestration/
  • Add the index.mdx vision document with the goals listed above
  • Add the ADR index (adr/index.mdx) with an empty registry table
  • Add the ADR template as adr/_template.mdx (not rendered, used as copy-paste source)
  • Add the sidebar entry in astro.config.mjs
  • Create a tracking issue in yaptide/yaptide and an org-level GitHub Project to coordinate cross-repo work
  • Agree on labels to use across repos (e.g. rework-orchestration, partial-results, horizontal-scaling)

Phase 1 — Capture the current state (AI-assisted)

Use Copilot / deep-research agents to analyze the existing codebase and populate the context/ section:

  • current-architecture.mdx — Have an AI agent trace the full simulation lifecycle through yaptide/yaptide source code (Celery chord setup, worker communication, merge logic, DB storage). Cross-reference with the existing architecture/overview.md and architecture/data-flow.md. Document what the code actually does today, not just what the docs say.
  • current-bottlenecks.mdx — Identify and quantify pain points: JSON serialization overhead for large results, pure-Python merge performance, Redis broker message size limits, lack of partial result visibility. Include concrete numbers where possible (file sizes, timings, memory usage).
  • deployment-constraints.mdx — Document the realities of each deployment target: PLGrid/SLURM (no persistent services, SSH-only access, shared filesystem), cloud/Docker Compose (full control, can add services like MinIO). Capture what is and isn't possible in each environment.
  • user-requirements.mdx — Gather input from researchers: what do they actually want to see during a running simulation? How quickly? What decisions would partial results enable? What is "good enough"?

Phase 2 — Explore options and make decisions (AI-assisted reasoning)

For each major decision, conduct a structured AI-assisted research session:

  • Research session: result storage — Discuss S3/MinIO vs Redis vs shared filesystem vs database BLOBs for partial results. Consider both deployment targets. Save the session log to research/, distill the outcome into adr/adr-001-s3-for-partial-results.mdx.
  • Research session: binary format — Compare Parquet vs ROOT vs HDF5 vs msgpack for result serialization. Evaluate: write speed, read speed, compression ratio, Python ecosystem support, browser-side readability. Distill into adr/adr-002-binary-format-selection.mdx.
  • Research session: merge algorithm — Profile current pure-Python merge. Benchmark NumPy, Polars, and Dask alternatives on representative data. Distill into adr/adr-003-merge-algorithm.mdx.
  • Research session: horizontal scaling — Explore how a single job can span multiple machines. Celery multi-node vs Dask distributed vs custom coordination. Implications for both SLURM and cloud. Distill into adr/adr-005-horizontal-scaling-model.mdx.
  • Research session: retry & logging — Evaluate Celery retry patterns, dead-letter queues, structured logging (structlog), centralized log aggregation. Distill into adr/adr-004-retry-and-logging.mdx.

Each research session should:

  1. Start with a clear question and relevant context files fed to the AI
  2. Explore at least 3 options with pros/cons
  3. End with a recommendation
  4. Be saved as a research/session-NNN-*.mdx file for auditability
  5. Result in a corresponding ADR with status proposed

Phase 3 — Detailed design

Based on accepted ADRs, write detailed design documents:

  • design/partial-results-streaming.mdx — Worker → S3 upload flow, API endpoints for partial result retrieval, UI polling/streaming strategy
  • design/horizontal-scaling-model.mdx — Job fan-out across nodes, task coordination, failure handling
  • design/result-merge-pipeline.mdx — New merge algorithm, data format in/out, memory budget, benchmarks

Phase 4 — Implementation tracking

  • Create sub-issues in yaptide/yaptide for each implementation work package, linked to the corresponding ADR and design document
  • Track progress in the org-level GitHub Project
  • Update ADR statuses as implementation proceeds (proposed → accepted, or → rejected with reasoning)

Phase 5 — Retrospective

  • After the main implementation is complete, update the vision document with outcomes
  • Mark ADRs that were superseded during implementation
  • Write a retrospective on the design process itself — what worked, what to improve for the next rework (e.g. UI)

Relation to other issues

Notes on AI-assisted workflow

The research/ folder is key to making this process work with AI tools. Each session file should:

  • State the question being investigated
  • List the context files that were provided to the AI (with repo links)
  • Capture the key insights and reasoning, not raw transcripts
  • End with a clear conclusion or set of options
  • Use plain Markdown with structured headings — this makes the file itself useful as context for future AI sessions

This creates a compounding knowledge base: each new AI session can be given the vision doc + relevant context files + previous research sessions, enabling increasingly informed reasoning over time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions