perf(arena3): inline size_class_index_for in allocation hot path#40
perf(arena3): inline size_class_index_for in allocation hot path#40mrhapile wants to merge 2 commits intoboa-dev:mainfrom
Conversation
…th call Root cause analysis showed that try_alloc() performed a function call to size_class_index_for on every allocation. Inlining the function allows LLVM to constant-fold the size class for monomorphized types, eliminating the call overhead. Criterion benchmarks: Pure bump allocation: before: ~4.33 µs after: ~3.34 µs improvement: ~6–10% Bulk allocation: no statistically significant change Free-list reuse: no statistically significant change Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
There was a problem hiding this comment.
Pull request overview
Optimizes arena3 allocation hot path by forcing inlining of the size-class lookup helper, and adds a Criterion benchmark to quantify allocation scenarios.
Changes:
- Add
#[inline(always)]tosize_class_index_forto reduce per-allocation call overhead and enable better constant-folding. - Add a new Criterion benchmark (
arena3_alloc) covering bump-style allocation, bulk allocation, and free-list reuse scenarios. - Register the new benchmark target in
oscars/Cargo.toml.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
oscars/src/alloc/arena3/mod.rs |
Forces inlining of size-class index computation used by try_alloc. |
oscars/benches/arena3_alloc.rs |
Introduces new Criterion benchmark scenarios for arena3 allocation behavior. |
oscars/Cargo.toml |
Adds a [[bench]] entry to enable running arena3_alloc with Criterion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| let mut alloc = ArenaAllocator::default().with_arena_size(1024 * 1024 * 10); | ||
| b.iter(|| { | ||
| for i in 0..1000 { | ||
| let ptr = alloc.try_alloc(bb(i)).unwrap(); | ||
| bb(ptr); | ||
| } | ||
| }); |
There was a problem hiding this comment.
In Scenario B, the allocator is created outside b.iter, so each iteration keeps allocating additional objects into the same ArenaAllocator. Over many Criterion iterations this can continuously grow memory usage (and may eventually OOM / change behavior once new pools are allocated), which will skew results and make the benchmark unstable. Consider using iter_batched/iter_batched_ref to construct a fresh allocator per measured iteration, or otherwise reset/free between iterations so each sample measures the same fixed workload.
| let mut alloc = ArenaAllocator::default().with_arena_size(1024 * 1024 * 10); | |
| b.iter(|| { | |
| for i in 0..1000 { | |
| let ptr = alloc.try_alloc(bb(i)).unwrap(); | |
| bb(ptr); | |
| } | |
| }); | |
| b.iter_batched( | |
| || { | |
| ArenaAllocator::default().with_arena_size(1024 * 1024 * 10) | |
| }, | |
| |mut alloc| { | |
| for i in 0..1000 { | |
| let ptr = alloc.try_alloc(bb(i)).unwrap(); | |
| bb(ptr); | |
| } | |
| }, | |
| criterion::BatchSize::SmallInput, | |
| ); |
oscars/benches/arena3_alloc.rs
Outdated
| use criterion::{criterion_group, criterion_main, Criterion}; | ||
| use oscars::alloc::arena3::ArenaAllocator; | ||
| use std::hint::black_box as bb; |
There was a problem hiding this comment.
This benchmark uses std::hint::black_box (aliased as bb), but the other Criterion benches in this repo import and use criterion::black_box (e.g. oscars/benches/arena2_vs_arena3.rs:1 and oscars/benches/oscars_vs_boa_gc.rs:1). To stay consistent across benches, prefer using Criterion’s black_box here too.
| use criterion::{criterion_group, criterion_main, Criterion}; | |
| use oscars::alloc::arena3::ArenaAllocator; | |
| use std::hint::black_box as bb; | |
| use criterion::{black_box as bb, criterion_group, criterion_main, Criterion}; | |
| use oscars::alloc::arena3::ArenaAllocator; |
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
perf(arena3): inline size_class_index_for to remove allocation hot-path call
Root cause analysis showed that try_alloc() performed a function call
to size_class_index_for on every allocation.
Inlining the function allows LLVM to constant-fold the size class for
monomorphized types, eliminating the call overhead.
Criterion benchmarks:
Pure bump allocation:
before: ~4.33 µs
after: ~3.34 µs
improvement: ~6–10%
Bulk allocation:
no statistically significant change
Free-list reuse:
no statistically significant change