Problem
After implementing emissive geometry sampling (#41), render performance has degraded significantly:
| Resolution |
Before Emissives |
After Emissives |
| 4K @ 200spp |
~20s |
~80s (estimated) |
| 1080p @ 200spp |
~5s |
~20s |
Root cause: Each path bounce now shoots an additional shadow ray to a sampled emissive triangle, effectively doubling BVH traversal cost per bounce.
Proposed Optimizations
1. Progressive Rendering Loop (High Priority)
Current implementation:
// Single kernel launch - 200 samples computed per pixel in one go
render_kernel<<<grid, block>>>(..., samplesPerPixel=200, ...);
Problems:
- GPU timeout risk on long renders
- No progress feedback
- No preview capability
curand_init() called per pixel (expensive, ~1000 cycles)
Proposed implementation:
// Multiple kernel launches - 1 sample per launch, accumulate results
for (int sample = 0; sample < totalSamples; sample++) {
render_kernel<<<grid, block>>>(..., sampleIndex=sample);
// Optional: progress callback, preview update
if (sample % 10 == 0) {
reportProgress(sample, totalSamples);
}
}
// Finalize: divide accumulated buffer by sample count
finalize_kernel<<<grid, block>>>(framebuffer, imageSize, 1.0f / totalSamples);
Benefits:
- No GPU timeout (each launch is fast)
- Real-time progress feedback
- Can display progressive preview during render
- Enables future adaptive sampling
- Remove
curand_init() entirely - use fungt::RNG seeded by sampleIndex
2. Shadow Ray Early Exit (High Priority)
Current: traceRayBVH() finds closest hit - unnecessary for shadow rays.
Proposed: Add traceShadowRayBVH() that returns on ANY hit:
fgt_device_gpu bool traceShadowRayBVH(
const fungt::Ray& ray,
const Triangle* tris,
const BVHNode* bvhNodes,
int numNodes,
float maxDist)
{
int stack[64];
int stackPtr = 0;
stack[stackPtr++] = 0;
while (stackPtr > 0) {
int nodeIdx = stack[--stackPtr];
const BVHNode& node = bvhNodes[nodeIdx];
if (!Intersection::intersectAABB(ray, node.m_boundingBox, 0.001f, maxDist))
continue;
if (node.isLeaf()) {
for (int i = 0; i < node.triCount; i++) {
int triIdx = node.firstTriIdx + i;
HitData temp;
if (Intersection::MollerTrumbore(ray, tris[triIdx], 0.001f, maxDist, temp)) {
return true; // EARLY EXIT
}
}
} else {
stack[stackPtr++] = node.leftChild;
stack[stackPtr++] = node.rightChild;
}
}
return false;
}
Expected impact: 30-50% faster shadow ray tests.
3. Limit NEE to First N Bounces (Medium Priority)
Emissive contribution after bounce 2-3 is negligible for most scenes.
// Only do NEE on first 3 bounces
if (numOfEmissiveTris > 0 && bounce < 3) {
// NEE sampling code
}
Expected impact: ~40% reduction in shadow rays.
4. Remove Unused curandState (Quick Win)
Current kernel initializes both RNGs but only uses one:
fungt::RNG rng(idx * 1337ULL + 123ULL); // Used
curandState randomState;
curand_init(seed + idx, 0, 0, &randomState); // NOT USED - 1000 cycles wasted
Remove curandState entirely and update pathTracer_CookTorrance signature.
5. Block Size Tuning (Quick Win)
Path tracing has divergent branching. Smaller blocks = less warp divergence.
// Current
dim3 block(16, 16); // 256 threads
// Proposed
dim3 block(8, 8); // 64 threads - test performance
6. Shared Memory for Light Data (Low Priority)
Cache emissive triangle indices in shared memory:
__shared__ int sharedEmissive[64];
int tid = threadIdx.y * blockDim.x + threadIdx.x;
if (tid < numOfEmissiveTris && tid < 64) {
sharedEmissive[tid] = emissiveTris[tid];
}
__syncthreads();
Expected impact: Minimal (5-10%) - emissive list is small and likely L2 cached.
Tasks
Acceptance Criteria
Technical Notes
Progressive Accumulation Math
Each kernel launch adds one sample:
framebuffer[idx] += sample_contribution
Final normalization:
framebuffer[idx] /= totalSamples
RNG Seeding for Progressive Rendering
Each sample needs unique randomness:
fungt::RNG rng(pixelIndex * 1337ULL + sampleIndex * 7919ULL);
Using prime multipliers ensures good distribution across both pixel and sample dimensions.
Related
Labels
enhancement performance path-tracer cuda
Problem
After implementing emissive geometry sampling (#41), render performance has degraded significantly:
Root cause: Each path bounce now shoots an additional shadow ray to a sampled emissive triangle, effectively doubling BVH traversal cost per bounce.
Proposed Optimizations
1. Progressive Rendering Loop (High Priority)
Current implementation:
Problems:
curand_init()called per pixel (expensive, ~1000 cycles)Proposed implementation:
Benefits:
curand_init()entirely - usefungt::RNGseeded bysampleIndex2. Shadow Ray Early Exit (High Priority)
Current:
traceRayBVH()finds closest hit - unnecessary for shadow rays.Proposed: Add
traceShadowRayBVH()that returns on ANY hit:Expected impact: 30-50% faster shadow ray tests.
3. Limit NEE to First N Bounces (Medium Priority)
Emissive contribution after bounce 2-3 is negligible for most scenes.
Expected impact: ~40% reduction in shadow rays.
4. Remove Unused curandState (Quick Win)
Current kernel initializes both RNGs but only uses one:
Remove
curandStateentirely and updatepathTracer_CookTorrancesignature.5. Block Size Tuning (Quick Win)
Path tracing has divergent branching. Smaller blocks = less warp divergence.
6. Shared Memory for Light Data (Low Priority)
Cache emissive triangle indices in shared memory:
Expected impact: Minimal (5-10%) - emissive list is small and likely L2 cached.
Tasks
RenderScene()finalize_kernel()for GPU-side divisiontraceShadowRayBVH()with early exitcurandStatefrom kernel and path tracerAcceptance Criteria
Technical Notes
Progressive Accumulation Math
Each kernel launch adds one sample:
Final normalization:
RNG Seeding for Progressive Rendering
Each sample needs unique randomness:
Using prime multipliers ensures good distribution across both pixel and sample dimensions.
Related
Labels
enhancementperformancepath-tracercuda