Releases: Devsh-Graphics-Programming/DirectXShaderCompiler
Nabla Path Tracer runtime compare 2026-03-28
Nabla Path Tracer runtime compare from Nsight Graphics
This directory contains one paired Nsight Graphics GPU Trace probe for Nabla Path Tracer.
Protocol:
1run per variant- same capture point:
frame 1000 - same effective render path:
- geometry:
sphere - effective method:
solid angle
- geometry:
- runtime numbers below come directly from
Nsight Graphicsexports:FRAME.xlsGPUTRACE_FRAME.xls
- measurement machine:
Variant matrix
Legend
flowchart LR
A["master_source_off\nNabla e11b118d\n2026-03-26"] --> B["devshfixes_upstream\nNabla c13c3366\n2026-03-28"] --> C["unroll_artifact\nNabla 262a8b72\n2026-03-26"] --> D["unroll_v2\nlocal O1experimental refresh\n2026-03-28"]This report compares four checkpoints:
master_source_off: currentmaster-side baselinedevshfixes_upstream: the same line after refreshingdevshFixeswith newer DXC upstream stateunroll_artifact: that refreshed line plus theunrollPR work packaged in the published CI artifactunroll_v2: an up-to-date local measurement with the new-O1experimentalflag after the latestunroll-line changes
Runtime Probe
| Variant | GPU frame ms | Dispatch count | Compute active | SM throughput | PCIe write GB/s |
|---|---|---|---|---|---|
master_source_off |
21.4304 |
2 |
83.2501% |
35.5388% |
2.62710 |
devshfixes_upstream |
19.6157 |
2 |
82.9923% |
38.2916% |
2.64694 |
unroll_artifact |
21.5935 |
2 |
83.9945% |
34.3346% |
2.62212 |
unroll_v2 |
19.2360 |
2 |
86.9712% |
38.7311% |
2.64514 |
Runtime deltas
| Comparison | Delta ms | Delta % |
|---|---|---|
devshfixes_upstream vs master_source_off |
-1.8147 |
-8.47% |
unroll_artifact vs master_source_off |
+0.1631 |
+0.76% |
unroll_artifact vs devshfixes_upstream |
+1.9778 |
+10.08% |
unroll_v2 vs master_source_off |
-2.1944 |
-10.24% |
unroll_v2 vs devshfixes_upstream |
-0.3797 |
-1.94% |
unroll_v2 vs unroll_artifact |
-2.3575 |
-10.92% |
Cold startup Vulkan API probe
Cold startup vkCreateComputePipelines was measured on the same published runnable bundles with cleared pipeline/shader cache and Vulkan API tracing enabled for the process.
| Variant | vkCreateComputePipelines calls |
vkCreateComputePipelines start->next sum ms |
|---|---|---|
master_source_off |
13 |
3737.11 |
devshfixes_upstream |
21 |
3332.55 |
unroll_artifact |
21 |
1418.86 |
unroll_v2 |
2 |
354.707 |
TODO: need to recheck vkCreateComputePipelines, those are wrong metrics
Main conclusion
The measured latest upstream refresh baseline is faster than master_source_off in this probe. At the same time unroll_artifact is effectively at parity with master_source_off here at only +0.76%, while the remaining gap appears only against devshfixes_upstream.
The up-to-date unroll_v2 follow-up, measured with the new -O1experimental flag, goes further: in this probe it is now faster than master_source_off by 10.24% on steady-state GPU frame time (19.2360 ms vs 21.4304 ms).
Taken together, the measured runtime cost points at the unroll side of the experiment, not at the generic DXC/SPIRV-Tools upstream refresh. That tradeoff is also aligned with the intent of the experiment: reduce shader build time aggressively while accepting a small runtime cost.
In practice this is also a strong argument for the new explicit -O1experimental path. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about 10x worse without -O1experimental, while the newest unroll_v2 follow-up is already faster than the current master baseline on this measured path. On this workload -O1experimental delivers the intended development tradeoff directly: a major build-time win together with favorable measured runtime.
unroll_v2 is the current local follow-up checkpoint after those latest changes. It keeps the same high-level workload shape (dispatch_count = 2) and shows where the updated -O1experimental line lands relative to the published unroll_artifact and the current master baseline.
Deeper Nsight signals from the same exports
Frame-level exports also show:
dispatch_count = 2andgr__ctas_launched_queue_sync.sum = 14401in all three variantsunroll_artifacthas lowerSM throughputthandevshfixes_upstreamunroll_artifactalso shows higher total executed instructions and much higherL1/LSU/sharedpressure thandevshfixes_upstreamunroll_v2raisesSM throughputback to38.7311%while keepingdispatch_count = 2
This points at a compute-side codegen / execution-mix difference with higher L1/LSU/shared pressure on the unroll side.
Directory map
Runtime stats
master_source_off/stats.jsondevshfixes_upstream/stats.jsonunroll_artifact/stats.jsonunroll_v2/stats.json
Machine spec
Executable locations
master_source_off:runnable/master_source_off_minimal/31_hlslpathtracer.exedevshfixes_upstream:runnable/devshfixes_upstream_minimal/31_hlslpathtracer.exeunroll_artifact:runnable/unroll_artifact_minimal/31_hlslpathtracer.exeunroll_v2:runnable/unroll_v2_minimal/31_hlslpathtracer_rwdi.exe
Capture files
master_source_off/run01/master_source_off_frame1000_run01.ngfx-capturedevshfixes_upstream/run01/devshfixes_upstream_frame1000_run01.ngfx-captureunroll_artifact/run01/unroll_artifact_frame1000_run01.ngfx-captureunroll_v2/run01/unroll_v2_frame1000_run01.ngfx-capture
Raw Nsight exports
master_source_off/run01/gpu-trace/BASE/FRAME.xlsmaster_source_off/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls- [`devshfixes_upstream/run01...