Skip to content

HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration#9915

Draft
ArafatKhan2198 wants to merge 6 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14816
Draft

HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration#9915
ArafatKhan2198 wants to merge 6 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14816

Conversation

@ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Mar 13, 2026

What changes were proposed in this pull request?

What the Recon AI Chatbot Does (For Context)

The Recon AI Assistant is an intelligent query interface that bridges the gap between natural language and Recon's complex REST APIs. Instead of requiring administrators to memorize API endpoints and manually parse JSON payloads, the chatbot acts as an autonomous agent.

Example Scenario:
User Question: > "How many datanodes are unhealthy?"

The Traditional Workflow:

  1. Administrator identifies the correct API endpoint (/containers/unhealthy or /datanodes).
  2. Administrator executes a curl command.
  3. Administrator manually parses the returned JSON payload.
  4. Administrator aggregates the data to find the answer.

The New Chatbot Workflow:

  1. The user asks the question in plain English via the Recon UI or Chatbot API.
  2. The Chatbot Agent leverages the LLMDispatcher to consult the AI model.
  3. The AI autonomously selects the correct internal Recon API based on the system guide and executes it.
  4. The backend synthesizes the resulting JSON and returns a concise, human-readable summary back to the user.

Recon AI Chatbot: Backend Architecture Guide - https://docs.google.com/document/d/1jyYZz0llKwN7lexJzVRjxfzl1Iwdbx8dABwLXGzAajQ/edit?tab=t.0

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14816

How was this patch tested?

Here is what you can add to the "How was this patch tested?" section of your PR description. I have formatted it to clearly show the testing methodology, the script you used to verify the multi-model routing, and the configuration instructions needed for reviewers to reproduce it!


How was this patch tested?

This patch was tested by configuring all three supported AI providers (Google Gemini, OpenAI, and Anthropic) via Docker configuration, and programmatically looping through 9 different AI models to verify that the LLMDispatcher correctly routes natural language queries to the appropriate provider without requiring the user to specify it.

1. Configuration Setup

To test this locally, reviewers can configure their API keys by adding the following properties to their docker-config file:

# Set the desired default provider (optional, defaults to gemini)
OZONE-SITE.XML_ozone.recon.chatbot.provider=gemini

# Provider API Keys
OZONE-SITE.XML_ozone.recon.chatbot.gemini.api.key=YOUR_GEMINI_API_KEY
OZONE-SITE.XML_ozone.recon.chatbot.openai.api.key=YOUR_OPENAI_API_KEY
OZONE-SITE.XML_ozone.recon.chatbot.anthropic.api.key=YOUR_CLAUDE_API_KEY

2. API Verification Script

A shell script was used to hit the Chatbot REST API (/api/v1/chatbot/chat), iterating through a predefined list of models to ensure proxying worked seamlessly across all client implementations.

for model in "gemini-2.5-pro" "gemini-2.5-flash" "gemini-3-flash-preview" "gemini-3.1-pro-preview" "gpt-4.1" "gpt-4.1-mini" "gpt-4.1-nano" "claude-opus-4-6" "claude-sonnet-4-6"; do
  echo "=================================================="
  echo "🧪 Testing Model: $model"
  echo "=================================================="

  curl -s -X POST http://localhost:9888/api/v1/chatbot/chat \
    -H "Content-Type: application/json" \
    -d '{
      "query": "Give me all the information there is about the current state of the cluster",      
      "model": "'"$model"'"
    }' | jq

  echo -e "\n"
done

Attached are the test results after evaluating each provider model by asking a question and recording its response - testingChatBot.txt

3. Test Results

  • Routing Validation: The LLMDispatcher successfully identified the intended provider based solely on the requested model name string (e.g., routing gpt-4.1 to the OpenAIClient, and gemini-2.5-flash to the GeminiClient).
  • Tool Execution Validation: All tested models successfully reasoned over the updated recon-api-guide.md, selected the correct Recon backend API endpoint (/clusterState), executed the API call internally, and summarized the JSON payload back into natural English.
  • Error Handling Validation: The dispatcher correctly identified and gracefully handled missing configuration errors when attempting to route to Anthropic models (claude-opus-4-6, claude-sonnet-4-6) without an API key present, returning expected JSON errors: {"error": "No API key configured for provider 'anthropic'."}

@ArafatKhan2198 ArafatKhan2198 changed the title Integrate an AI Assistant to Recon. HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration Mar 13, 2026
@devmadhuu devmadhuu self-requested a review March 17, 2026 07:12
Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 thanks for the patch, few comments at a high level in initial review. Please check.

Also one observation as per output you mentioned in PR description, various models output for same question: "Give me all the information there is about the current state of the cluster", I can see some wrong answers even. E.g. for gemini-3-flash-preview, it says that "All 12 internal Recon tasks are functioning correctly". Not sure from where the model got 12 tasks. This is surely it is hallucinating.

/**
* Set per-request by processQuery; used to inject provider hint.
*/
private volatile String currentProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is per request provider, then this volatile global field can create race condition for 2 concurrent request. ChatbotAgent is a Singleton class and if this field is mutated per request in processQuery(), then two threads will race on this field. volatile prevents visibility issues but not atomicity across a request's full lifecycle.

public ToolExecutor(OzoneConfiguration configuration) {
// Get Recon base URL from configuration
// Default to localhost for local development
this.reconBaseUrl = "http://localhost:9888";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be taken from OzoneConfiguration with proper URL and port.

HttpURLConnection conn = null;
try {
// Connect to the Recon URL
conn = (HttpURLConnection) new URL(url).openConnection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think , When Kerberos / SPNEGO is enabled on Recon , every call from ToolExecutor will get a 401 Unauthorized. so we should use the existing auth mechanism.


private static final ObjectMapper MAPPER = new ObjectMapper();
private static final String ANTHROPIC_VERSION = "2023-06-01";
private static final String ANTHROPIC_BETA_CONTEXT = "context-1m-2025-08-07";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding, this is beta header provided by Anthropic that controls 1-million-token context windows, but this seems header for a feature that may not be available yet (today is March 2026, but this header references August 2025 as a point in the past, suggesting it was already published). Shouldn't this be configurable or dynamically determined ?


@Override
public List<String> getSupportedModels() {
return Arrays.asList("gemini-2.5-pro", "gemini-2.5-flash", "gemini-3-flash-preview", "gemini-3.1-pro-preview");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel, these should not be hardcoded as discussed earlier. Two problems, first, we don't know if these models can be discontinued later or renamed in future. Second: These incorrect model names will silently route to the correct provider but fail at the provider with a model-not-found error. Model lists should either be fetched from the provider's API at startup.

private static final Logger LOG = LoggerFactory.getLogger(ChatbotAgent.class);

private static final ObjectMapper MAPPER = new ObjectMapper();
private static final Pattern JSON_PATTERN = Pattern.compile("\\{.*\\}", Pattern.DOTALL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding, this regex is used to filter out JSON from first LLM tool call, but this {.*} with DOTALL will match from the first { to the last } in the response. If the LLM returns reasoning text containing any JSON-like fragment before or after the tool call , this regex can return garbage or the wrong object. I would suggest to use following:

  • a non-greedy match \{.*?\} scoped to the outermost brace level.

* Creates the master rules (System Prompt) we send to the LLM during Step 1.
* Notice how we teach the LLM exactly what JSON to output!
*/
private String buildToolSelectionPrompt() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to defined these prompts in a resource file and load. May be at same location alongside recon api guide.

# Enable Recon Chatbot for testing
OZONE-SITE.XML_ozone.recon.chatbot.enabled=true
OZONE-SITE.XML_ozone.recon.chatbot.provider=gemini
OZONE-SITE.XML_ozone.recon.chatbot.gemini.api.key=YOUR_API_KEY_HERE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a user sample key to test it out locally. Better to add comment that this key should not be committed and pushed in git, so by mistake if committed, anyone can use. Also by default, comment out this config in docker config file here.


// Pass the user's question to the Brain (ChatbotAgent) to do all the hard work.
// This step takes a few seconds because it talks to Gemini and the Recon APIs.
String response = chatbotAgent.processQuery(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this whole end to end call is synchronous , so if first LLM tool selection call takes max of 120 secs (timeout) and 2nd LLM call for summarization again with max of 120 secs (timeout), now assume if between these 2 calls, there can be max of 5 Recon API calls (worst case) and each Recon API call can take 30 secs , total time for Recon API calls: 30 x 5 = 150 sec, and total execution time for this whole processQuery can go upto 120 + 150 + 120 ~ 6.5 mins. Not a usual scenario , but it can happen and block all main Recon server threads if 2 or more concurrent calls comes like this. The thread for this HTTP request is blocked for the entire duration. Jetty (which Recon uses) has a fixed thread pool. If 10 users simultaneously send slow queries, 10 threads are blocked from jetty server pool, Once this thread pool exhausted, then all other Recon API for all other pages will not be able to serve requests. I would suggest to use JAX-RS's @Suspended AsyncResponse and use a dedicated thread pool executor for submitting chatbot requests.

// Hardcoded security timeouts. If Recon takes longer than 30 seconds to connect
// or return data, kill the request so we don't freeze the chatbot.
private static final int CONNECT_TIMEOUT_MS = 30_000;
private static final int READ_TIMEOUT_MS = 30_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not configurable ? In large clusters where some other Recon APIs are slow due , e.g. large number of unhealthy containers like in millions, 30 seconds may not be enough for Recon to respond, causing a timeout that looks like a chatbot failure when the cluster is actually healthy but just slow. These should be added as proper config keys in ChatbotConfigKeys, consistent with how the LLM timeout is handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants