Skip to content

Conversation

@anyuzoey
Copy link
Collaborator

Overview

This PR adds faketooltest.py, a script designed to evaluate how well LlamaStack handles an increasing number of tools by measuring tool selection accuracy, execution success, and latency.

This was a custom experiment to explore LlamaStack’s scalability, but now we will close this approach and shift focus to more systematic evaluations using established benchmarks for a more comprehensive analysis.

Experiment Setup

  • 5 Real Tools: Weather info, word count, string reversal, uppercase conversion, insurance scoring.
  • Fake Tools: Dynamically generated tools with random outputs (up to 40 additional tools).
  • 5 Fixed Queries: Each mapped to a ground truth tool.
  • Scaling: Start with 5 tools, increase by 5 up to 45.
  • Metrics Logged:
    • Exception Rate (how many exception occurs out of 5 queries)
    • Tool Execution Success Rate (how many time tools are actually executed out of 5 queries)
    • Correct Tool Selection Rate (how many time correct tool is selected out of 5 queries)
    • Average Latency (average time taken to respond 5 queries)

Run the Experiment

python faketooltest.py

Results are saved in experiment_results.csv for analysis.

Conclusion

current test conclude, llama 3B model can deal with max 25 tools. Latency per query 0.178s (serverd locally)

Limitations

  • Fake tools are highly similar, making them easy to distinguish from real tools, also no parameter.
  • Only 5 queries, limiting diversity in tool usage.
  • Model may perform better here than in real-world scenarios with more diverse tools.

Next Steps

  • Move to a proper benchmark with a broader toolset.
  • Incorporate realistic tool diversity to stress test selection accuracy.
  • Compare results across different model sizes to assess generalization.

@anyuzoey anyuzoey requested a review from suppathak March 12, 2025 12:08
@MichaelClifford MichaelClifford self-requested a review March 17, 2025 18:01
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@anyuzoey
Copy link
Collaborator Author

💡 Key Insights So Far

  • (26 Mar) Ruled out temperature as a factor. Even with the temperature set to 0.001, we observed a maximum of 11, 16, and 15 tools in three runs. Temporarily shifting focus to MCP tasks. Wrap up latest updates and revisit later.

  • (24 Mar) The 3B model last week consistently handled 24 tools. However, this week with v0.1.8, it handled 11, 18, and 23 tools in three different runs. Suspect temperature-related parameters were changed for the 3B model. The 8B model will be tested to see if it follows the same pattern. A draft token count script count_token.ipynb has been created.

    • Findings: Currently, v1.8 supports token metrics but only for the client.inference.chat_completion function. It is only the first step out of 3 for response = agent.create_turn( when involving tool calls
    • still working on how to proper count token used for tool sets following llamastack way.
  • (20 Mar)

    • Improved the maxtool test script with a diverse fake tool generation method.
    • Refined the script for later scale experiments with logs and switched to an IPython notebook for better visualization.
    • findings:
      • LLaMA-8B can handle around 21 tools (3B is about 24) before misidentifying the correct one.
      • Extending tool descriptions reduced that number to 18, suggesting performance is bound by docstring.
      • Extending tool name reduced that number further to 17, suggesting performance is bound by tool name.
      • Extending tool return message does not affect.
      • (suspect) Models may either:
        • Prioritize later tools in prompt context (due to recency bias).
        • Or, after exceeding a threshold, fail to abstract and match any tools, even among the first few.
      • Even when inference still returns a response, the selected tool may be incorrect or invalid.
      • leading to investigate token size for tools.
      • Local vs. cluster-hosted models (e.g., on NERC) behave differently—even for identical 3B models—likely due to variations in runtime or configuration (e.g., token limit in VLLM's run.yaml).
  • (by 11 Mar) Developed the initial max tool test script. However, the fake tools lacked diversity, resulting in overly optimistic max tool counts. Spent time reading and finding existing benchmark literature.

@anyuzoey anyuzoey self-assigned this Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant