Skip to content

[Docs] Onboarding notebooks (1/n): expr foundation#584

Open
jhinpan wants to merge 4 commits into
ROCm:mainfrom
jhinpan:docs/notebooks-573-expr-foundation
Open

[Docs] Onboarding notebooks (1/n): expr foundation#584
jhinpan wants to merge 4 commits into
ROCm:mainfrom
jhinpan:docs/notebooks-573-expr-foundation

Conversation

@jhinpan
Copy link
Copy Markdown
Contributor

@jhinpan jhinpan commented May 28, 2026

Summary

First PR (1/n) toward the onboarding notebook series requested in #573. Rather than jumping straight to vector-add + Layout 101, this set builds the flydsl.expr foundation bottom-up and stops before layout algebra (deferred to a follow-up series), so the later layout material rests on solid primitives.

Four notebooks in examples/notebooks/:

# Notebook Topic
00 00_hello_flydsl the @flyc.kernel / @flyc.jit trace model; reading dumped IR (FLYDSL_DUMP_IR)
01 01_numeric_types scalar type system (ints, floats, bf16/fp8), casts, promotion, Constexpr vs runtime
02 02_struct @fx.struct aggregate value types and their C-style memory layout
03 03_universal_ops target-agnostic Universal* atoms + a fully-universal vector-add capstone (validated vs torch)

Emphasis throughout is arch-neutrality: the capstone moves data with UniversalCopy32b (no rocdl/CDNA-specific atoms), and the IR peek shows the !fly.universal_copy<32> op before it specializes in convert_fly_to_rocdl.

Notes

  • Run-verified end-to-end on an MI350X (gfx950); committed with outputs cleared for clean diffs (re-run to populate).
  • Notebooks need wurlitzer (pip install jupyter wurlitzer) to show GPU printf inline — Jupyter doesn't capture device stdout on its own. See examples/notebooks/README.md.
  • Deferred to follow-ups to keep this PR small/reviewable: nbsphinx docs rendering, an nbmake execute-check CI job, and the layout/MMA notebook series.

Test plan

  • All four notebooks execute top-to-bottom with no errors on gfx950
  • Capstone matches torch (torch.allclose)
  • Reviewer: confirm location (examples/notebooks/) and whether to wire nbmake CI now or in the follow-up

Refs #573.

🤖 Generated with Claude Code

jhinpan and others added 3 commits May 28, 2026 07:41
Interactive, bottom-up onboarding notebooks for the flydsl.expr foundation,
bridging a newcomer to the existing examples/. This first set (1/n) covers:

- 00_hello_flydsl     - the @flyc.kernel / @flyc.jit trace model; reading dumped IR
- 01_numeric_types    - the scalar type system (ints, floats, bf16/fp8), casts,
                        promotion, and Constexpr vs runtime values
- 02_struct           - @fx.struct aggregate value types and their memory layout
- 03_universal_ops    - the target-agnostic Universal* atoms, with a fully-universal
                        vector-add capstone validated against torch

Layout algebra (make_layout / logical_divide / tiled copy / MMA) is intentionally
deferred to a follow-up series. All cells were run-verified on an MI350X (gfx950)
and committed with outputs cleared. A short README indexes the series.

Refs ROCm#573.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rity)

Sharpen the series for agent consumers (fast ramp, fewer source lookups):

- README: add a flydsl.expr API cheat-sheet (kernel/jit/launch, scalars, structs,
  copy atoms + register tensors) and the three printf/wurlitzer/Constexpr gotchas,
  so the whole foundation is reachable in one place.
- 03_universal_ops:
  - explain the host->device tensor handoff (raw torch tensor vs from_dlpack +
    mark_layout_dynamic) and annotate the jit C param as fx.Tensor for consistency;
  - bridge nb01's '+' operator to nb03's register-tensor
    memref_load_vec / arith.addf / memref_store_vec compute;
  - stop calling UniversalFMA 'the MMA atom' (it has no real usage; MMA lowers via
    rocdl.MFMA) in the atom-family list and the Recap;
  - fix the pass name convert_fly_to_rocdl -> convert-fly-to-rocdl.

Re-ran all four notebooks on a current gfx950 build: 00/01/02/03 clean, vadd matches
torch, IR shows universal_copy<32>. Outputs committed cleared.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jhinpan jhinpan marked this pull request as ready for review May 31, 2026 04:29
Copilot AI review requested due to automatic review settings May 31, 2026 04:29
@jhinpan
Copy link
Copy Markdown
Contributor Author

jhinpan commented May 31, 2026

@coderfeli ready for review when you have a moment. PR 1/n of the expr-foundation onboarding series (#573): four notebooks (hello / numeric types / struct / universal ops + a vadd capstone), all re-run clean on gfx950 with outputs committed cleared.

Since the series increasingly doubles as agent-onboarding material, this last push adds a one-stop flydsl.expr API cheat-sheet to the README and tightens nb03 — the host→device tensor handoff (from_dlpack/mark_layout_dynamic), the +memref_load_vec/arith.addf register-tensor bridge, dropping the UniversalFMA-as-MMA framing in favor of rocdl.MFMA, and the convert-fly-to-rocdl pass name. Happy to adjust.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a set of onboarding Jupyter notebooks and a companion README to introduce the flydsl.expr foundations and how to inspect generated IR.

Changes:

  • Add examples/notebooks/README.md index + API cheat-sheet + run instructions.
  • Add four onboarding notebooks covering kernel/jit basics, numeric types, @fx.struct, and Universal* atoms with a vector-add capstone.
  • Include IR-dumping walkthroughs via FLYDSL_DUMP_IR / FLYDSL_DUMP_DIR.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
examples/notebooks/README.md Adds onboarding index, cheat-sheet, and execution instructions for the notebook series
examples/notebooks/00_hello_flydsl.ipynb Introduces @flyc.kernel / @flyc.jit, GPU printf capture, and IR dumping
examples/notebooks/01_numeric_types.ipynb Documents scalar type families, arithmetic, casts/promotions, and Constexpr
examples/notebooks/02_struct.ipynb Introduces @fx.struct, layout queries, .replace(), and shared-memory struct use
examples/notebooks/03_universal_ops.ipynb Explains universal atoms and builds a universal vector-add example + IR inspection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +15
| # | Notebook | Topic |
|---|----------|-------|
| 00 | [`00_hello_flydsl.ipynb`](00_hello_flydsl.ipynb) | the `@flyc.kernel` / `@flyc.jit` model; reading dumped IR |
| 01 | [`01_numeric_types.ipynb`](01_numeric_types.ipynb) | scalar types: ints, floats, `bf16`/`fp8`, casts, `Constexpr` |
| 02 | [`02_struct.ipynb`](02_struct.ipynb) | `@fx.struct` aggregate value types and their memory layout |
| 03 | [`03_universal_ops.ipynb`](03_universal_ops.ipynb) | target-agnostic `Universal*` atoms + a vector-add capstone |
# Kernel + launch (00)
@flyc.kernel # device kernel; the body is traced to MLIR
@flyc.jit # host launch wrapper
kernel(args).launch(grid=(gx, 1, 1), block=[bx, 1, 1], stream=stream)
Comment on lines +156 to +160
"@flyc.jit\n",
"def vadd(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor, n: fx.Int32, stream: fx.Stream = fx.Stream(None)):\n",
" block_dim = 64\n",
" grid_x = (n + block_dim - 1) // block_dim\n",
" vadd_kernel(A, B, C, block_dim).launch(grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream)\n",
Comment on lines +210 to +212
"dump_dir = tempfile.mkdtemp(prefix=\"flydsl_ir_\")\n",
"os.environ[\"FLYDSL_DUMP_IR\"] = \"1\"\n",
"os.environ[\"FLYDSL_DUMP_DIR\"] = dump_dir\n",
" add_one(fx.Int32(41), stream=torch.cuda.Stream())\n",
" torch.cuda.synchronize()\n",
"\n",
"os.environ.pop(\"FLYDSL_DUMP_IR\", None) # stop dumping for the rest of the notebook\n",
Comment on lines +205 to +206
"origin = sorted(glob.glob(os.path.join(dump_dir, \"*\", \"00_origin.mlir\")))[0]\n",
"atom_lines = [ln.strip() for ln in open(origin).read().splitlines() if \"copy_atom\" in ln]\n",
Comment on lines +28 to +33
"execution": {
"iopub.execute_input": "2026-05-28T07:39:35.911107Z",
"iopub.status.busy": "2026-05-28T07:39:35.911005Z",
"iopub.status.idle": "2026-05-28T07:39:36.638621Z",
"shell.execute_reply": "2026-05-28T07:39:36.637986Z"
},
…c metadata)

Acted on the substantive Copilot comments:
- Close the IR-dump file reads with `with open(...)` in nb00 and nb03 (C6).
- Pop both FLYDSL_DUMP_IR and FLYDSL_DUMP_DIR after the dump cell so the env we set
  doesn't linger for later cells (C4/C5) -- without the suggested try/finally, which
  would be defensive noise for a teaching cell.
- Strip per-cell `metadata.execution` timestamps from all four notebooks so re-running
  doesn't churn the diff (C7); matches the outputs-cleared convention.

Did not act on the false positives: the README table uses single-pipe rows (not
'||'); `block=[...]` and a dynamic `n: fx.Int32` launcher arg both match the
canonical examples/01-vectorAdd.py and run clean on gfx950.

Re-ran all four on a current build: 00/01/02/03 clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jhinpan
Copy link
Copy Markdown
Contributor Author

jhinpan commented May 31, 2026

Thanks @copilot — went through all 7. Pushed c8754c70 with the substantive ones:

  • Close file handles (nb00, nb03 IR reads) → with open(...). ✅
  • Dump-dir env leak → pop both FLYDSL_DUMP_IR and FLYDSL_DUMP_DIR. ✅ (Skipped the try/finally — it's defensive noise for a teaching cell, and the leak was benign anyway since dumping is gated on DUMP_IR.)
  • Execution-timestamp metadata → stripped from all four notebooks so re-runs don't churn the diff. ✅ (Matches the outputs-cleared convention; a repo-wide nbstripout pre-commit hook would be the durable fix — happy to do that as a separate infra PR.)

The other three I'm leaving as-is — they don't reproduce:

  • README || table: the rows are single-pipe (| 00 | … |); renders fine.
  • block=[bx, 1, 1] / dynamic n: fx.Int32: both mirror examples/01-vectorAdd.py (grid=(…), block=[…], n: fx.Int32 then grid_x = (n + block_dim - 1) // block_dim) — .launch takes a list, and the dynamic-i32 launcher arg is the canonical idiom. All four notebooks re-run clean on gfx950.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants