Effort: ~1 week · Expected gain: Removes 68 CPU draw calls, enables GPU-side cull · Status: deferred
The scene graph issues one CPU draw call per mesh (scene.render() loops
over every node and calls pass.draw_indexed(...)). On Sponza that's 68
draws per frame per pass — shadow pass does it 3× (once per cascade),
depth prepass (ticket 005) will do it once, main_hdr once. ~340 CPU draw
calls per frame.
The CPU wins landed already (uniform pool, frustum cull) cut most of the
per-draw overhead, but we still have 340 set_bind_group calls. GPU-driven
rendering collapses this to one draw_indirect_count call — the GPU does
the culling and dispatches its own draws.
- One shared vertex buffer + one shared index buffer for all scene
geometry. On mesh upload, append vertices/indices into the shared buffers
and record
(vertex_offset, index_offset, index_count)per mesh. - Per-draw descriptor buffer (storage buffer): one struct per mesh
containing
{ transform, material_idx, aabb, index_offset, index_count, vertex_offset }. Updated from the scene graph inprepare(). - GPU cull compute pass: dispatch one thread per mesh. Each thread
tests its mesh's AABB against the frustum (using the same
extract_frustum_planeslogic we use on the CPU today). Surviving draws append to an indirect-draw buffer via an atomic counter. - Single
draw_indexed_indirect_countcall in the scene render pass. GPU reads the indirect buffer, dispatches each surviving mesh. - Material data lives in a storage buffer indexed by
material_idx, fetched per-draw in the vertex or fragment shader.
wgpu 24 supports draw_indexed_indirect and draw_indexed_indirect_count
via the Features::INDIRECT_FIRST_INSTANCE and MULTI_DRAW_INDIRECT_COUNT
feature flags. Check adapter support at device creation.
- "GPU-Driven Rendering" (Haar & Aaltonen, SIGGRAPH 2015) — the ubisoft talk that kicked off the modern approach
- UE5 Nanite's "Cluster-based" variant — each cluster of triangles is a separate cull unit.
- NVIDIA GameWorks samples have a clean indirect-multi-draw demo.
- Sponza main_hdr pass CPU time drops from ~700 µs to < 100 µs (measured via
profiler's
main_hdr_passCPU phase). - Frustum culling ratio (surviving draws / total meshes) logged per frame and reasonable (e.g. 30-70% culled on typical Sponza camera poses).
- Correctness: SSIM ≥ 0.99 vs baseline.
- Doesn't break on meshes that use different materials (material index is part of the descriptor).
- Graceful fallback when the adapter doesn't support multi-draw-indirect (write a TODO to handle — M1 Metal supports it).
- This is a separate win from depth prepass (ticket 005); compose well — prepass and main pass can share the same indirect draw buffer.
- Skinned meshes need their joint matrices fetched per-draw — extend the descriptor or keep skinning in a separate pass.
- Biggest risk: materials need to be bound globally (bindless textures) or the fragment shader still needs per-material bind group switches. On wgpu/Metal, bindless is limited — may need a texture-array trick.
native/shared/src/renderer.rs— shared VB/IB, descriptor buffer, GPU cull compute shader, new render pass usingdraw_indexed_indirect_count.native/shared/src/scene.rs— reworking of per-node GPU resources.
Pure CPU-side optimization: removes ~340 CPU draw calls/frame on Sponza. But the perf README's own rule of thumb applies — Sponza is GPU-bound, not CPU-bound. The prior CPU-side wins (uniform pool, frustum cull, matrix-inverse cache from commit 95da6af) already cut render-total CPU to ~4 ms against a 16.7 ms vsync budget. Shaving another ~600 µs of CPU via draw-call collapsing won't move FPS on the current benchmark — we'd be optimizing a resource we already have in surplus.
Reopen when:
- A CPU-bound scene arrives — 10 000+ mesh count, many small
static props, or CPU-expensive per-frame state updates that push
render_totalCPU past the vsync budget. - Ticket 008 (visibility buffer) starts. 008's shading pass needs a shared vertex/index buffer + per-mesh descriptor buffer — exactly what this ticket builds. If 008 reopens, this ticket is a hard prerequisite and should land first.
- Bindless texture support lands in wgpu. The current "one
set_bind_groupper draw" pattern is partly about per-material texture binds. With bindless, indirect multi-draw becomes a straightforward win without the material-binding workarounds the ticket's "Notes for the implementer" describes.
Estimated effort when reopening: ~1 week for the baseline
draw_indexed_indirect_count path with GPU frustum cull. Material
indirection still requires either bindless (not widely supported in
wgpu 29) or a texture-array trick — that's where the ticket's risk
sits, and why it's scoped at "week" not "days."