WIP: add experiment script for max tool per agent #40

anyuzoey · 2025-03-12T12:08:09Z

Overview

This PR adds faketooltest.py, a script designed to evaluate how well LlamaStack handles an increasing number of tools by measuring tool selection accuracy, execution success, and latency.

This was a custom experiment to explore LlamaStack’s scalability, but now we will close this approach and shift focus to more systematic evaluations using established benchmarks for a more comprehensive analysis.

Experiment Setup

5 Real Tools: Weather info, word count, string reversal, uppercase conversion, insurance scoring.
Fake Tools: Dynamically generated tools with random outputs (up to 40 additional tools).
5 Fixed Queries: Each mapped to a ground truth tool.
Scaling: Start with 5 tools, increase by 5 up to 45.
Metrics Logged:
- Exception Rate (how many exception occurs out of 5 queries)
- Tool Execution Success Rate (how many time tools are actually executed out of 5 queries)
- Correct Tool Selection Rate (how many time correct tool is selected out of 5 queries)
- Average Latency (average time taken to respond 5 queries)

Run the Experiment

python faketooltest.py

Results are saved in experiment_results.csv for analysis.

Conclusion

current test conclude, llama 3B model can deal with max 25 tools. Latency per query 0.178s (serverd locally)

Limitations

Fake tools are highly similar, making them easy to distinguish from real tools, also no parameter.
Only 5 queries, limiting diversity in tool usage.
Model may perform better here than in real-world scenarios with more diverse tools.

Next Steps

Move to a proper benchmark with a broader toolset.
Incorporate realistic tool diversity to stress test selection accuracy.
Compare results across different model sizes to assess generalization.

…IP count-token script

review-notebook-app · 2025-03-26T13:28:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

anyuzoey · 2025-03-26T13:31:00Z

💡 Key Insights So Far

(26 Mar) Ruled out temperature as a factor. Even with the temperature set to 0.001, we observed a maximum of 11, 16, and 15 tools in three runs. Temporarily shifting focus to MCP tasks. Wrap up latest updates and revisit later.
(24 Mar) The 3B model last week consistently handled 24 tools. However, this week with v0.1.8, it handled 11, 18, and 23 tools in three different runs. Suspect temperature-related parameters were changed for the 3B model. The 8B model will be tested to see if it follows the same pattern. A draft token count script count_token.ipynb has been created.
- Findings: Currently, v1.8 supports token metrics but only for the client.inference.chat_completion function. It is only the first step out of 3 for response = agent.create_turn( when involving tool calls
- still working on how to proper count token used for tool sets following llamastack way.
(20 Mar)
- Improved the maxtool test script with a diverse fake tool generation method.
- Refined the script for later scale experiments with logs and switched to an IPython notebook for better visualization.
- findings:
  - LLaMA-8B can handle around 21 tools (3B is about 24) before misidentifying the correct one.
  - Extending tool descriptions reduced that number to 18, suggesting performance is bound by docstring.
  - Extending tool name reduced that number further to 17, suggesting performance is bound by tool name.
  - Extending tool return message does not affect.
  - (suspect) Models may either:
    - Prioritize later tools in prompt context (due to recency bias).
    - Or, after exceeding a threshold, fail to abstract and match any tools, even among the first few.
  - Even when inference still returns a response, the selected tool may be incorrect or invalid.
  - leading to investigate token size for tools.
  - Local vs. cluster-hosted models (e.g., on NERC) behave differently—even for identical 3B models—likely due to variations in runtime or configuration (e.g., token limit in VLLM's run.yaml).
(by 11 Mar) Developed the initial max tool test script. However, the fake tools lacked diversity, resulting in overly optimistic max tool counts. Spent time reading and finding existing benchmark literature.

add experiment script for max tool per agent

c054adc

anyuzoey requested a review from suppathak March 12, 2025 12:08

anyuzoey added 2 commits March 12, 2025 17:49

isolate different error cases and print accordingly

4615ce9

remove nerc info

3a4e989

MichaelClifford self-requested a review March 17, 2025 18:01

Wrap up latest update: add experiment logs, improve maxtool script, W…

a0d82b6

…IP count-token script

anyuzoey self-assigned this Mar 26, 2025

miss a file save

7790c2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: add experiment script for max tool per agent #40

WIP: add experiment script for max tool per agent #40

Uh oh!

anyuzoey commented Mar 12, 2025

Uh oh!

review-notebook-app bot commented Mar 26, 2025

Uh oh!

anyuzoey commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP: add experiment script for max tool per agent #40

Are you sure you want to change the base?

WIP: add experiment script for max tool per agent #40

Uh oh!

Conversation

anyuzoey commented Mar 12, 2025

Overview

Experiment Setup

Run the Experiment

Conclusion

Limitations

Next Steps

Uh oh!

review-notebook-app bot commented Mar 26, 2025

Uh oh!

anyuzoey commented Mar 26, 2025

💡 Key Insights So Far

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant