- Test model families at different sizes (7B → 70B+). - Plot success rate and speedup vs. model parameters / compute. - Include at least one reasoning model (o1-style) vs. non-reasoning.