HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration by ArafatKhan2198 · Pull Request #9915 · apache/ozone

ArafatKhan2198 · 2026-03-13T08:27:23Z

What changes were proposed in this pull request?

What the Recon AI Chatbot Does (For Context)

The Recon AI Assistant is an intelligent query interface that bridges the gap between natural language and Recon's complex REST APIs. Instead of requiring administrators to memorize API endpoints and manually parse JSON payloads, the chatbot acts as an autonomous agent.

Example Scenario:
User Question: > "How many datanodes are unhealthy?"

The Traditional Workflow:

Administrator identifies the correct API endpoint (/containers/unhealthy or /datanodes).
Administrator executes a curl command.
Administrator manually parses the returned JSON payload.
Administrator aggregates the data to find the answer.

The New Chatbot Workflow:

The user asks the question in plain English via the Recon UI or Chatbot API.
The Chatbot Agent leverages the LLMDispatcher to consult the AI model.
The AI autonomously selects the correct internal Recon API based on the system guide and executes it.
The backend synthesizes the resulting JSON and returns a concise, human-readable summary back to the user.

Recon AI Chatbot: Backend Architecture Guide - https://docs.google.com/document/d/1jyYZz0llKwN7lexJzVRjxfzl1Iwdbx8dABwLXGzAajQ/edit?tab=t.0

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14816

How was this patch tested?

Here is what you can add to the "How was this patch tested?" section of your PR description. I have formatted it to clearly show the testing methodology, the script you used to verify the multi-model routing, and the configuration instructions needed for reviewers to reproduce it!

How was this patch tested?

This patch was tested by configuring all three supported AI providers (Google Gemini, OpenAI, and Anthropic) via Docker configuration, and programmatically looping through 9 different AI models to verify that the LLMDispatcher correctly routes natural language queries to the appropriate provider without requiring the user to specify it.

1. Configuration Setup

To test this locally, reviewers can configure their API keys by adding the following properties to their docker-config file:

# Set the desired default provider (optional, defaults to gemini)
OZONE-SITE.XML_ozone.recon.chatbot.provider=gemini

# Provider API Keys
OZONE-SITE.XML_ozone.recon.chatbot.gemini.api.key=YOUR_GEMINI_API_KEY
OZONE-SITE.XML_ozone.recon.chatbot.openai.api.key=YOUR_OPENAI_API_KEY
OZONE-SITE.XML_ozone.recon.chatbot.anthropic.api.key=YOUR_CLAUDE_API_KEY

2. API Verification Script

A shell script was used to hit the Chatbot REST API (/api/v1/chatbot/chat), iterating through a predefined list of models to ensure proxying worked seamlessly across all client implementations.

for model in "gemini-2.5-pro" "gemini-2.5-flash" "gemini-3-flash-preview" "gemini-3.1-pro-preview" "gpt-4.1" "gpt-4.1-mini" "gpt-4.1-nano" "claude-opus-4-6" "claude-sonnet-4-6"; do
  echo "=================================================="
  echo "🧪 Testing Model: $model"
  echo "=================================================="

  curl -s -X POST http://localhost:9888/api/v1/chatbot/chat \
    -H "Content-Type: application/json" \
    -d '{
      "query": "Give me all the information there is about the current state of the cluster",      
      "model": "'"$model"'"
    }' | jq

  echo -e "\n"
done

Attached are the test results after evaluating each provider model by asking a question and recording its response - testingChatBot.txt

3. Test Results

Routing Validation: The LLMDispatcher successfully identified the intended provider based solely on the requested model name string (e.g., routing gpt-4.1 to the OpenAIClient, and gemini-2.5-flash to the GeminiClient).
Tool Execution Validation: All tested models successfully reasoned over the updated recon-api-guide.md, selected the correct Recon backend API endpoint (/clusterState), executed the API call internally, and summarized the JSON payload back into natural English.
Error Handling Validation: The dispatcher correctly identified and gracefully handled missing configuration errors when attempting to route to Anthropic models (claude-opus-4-6, claude-sonnet-4-6) without an API key present, returning expected JSON errors: {"error": "No API key configured for provider 'anthropic'."}

…ini integration.

devmadhuu

@ArafatKhan2198 thanks for the patch, few comments at a high level in initial review. Please check.

Also one observation as per output you mentioned in PR description, various models output for same question: "Give me all the information there is about the current state of the cluster", I can see some wrong answers even. E.g. for gemini-3-flash-preview, it says that "All 12 internal Recon tasks are functioning correctly". Not sure from where the model got 12 tasks. This is surely it is hallucinating.

devmadhuu · 2026-03-17T14:28:24Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ChatbotAgent.java

+  /**
+   * Set per-request by processQuery; used to inject provider hint.
+   */
+  private volatile String currentProvider;


If this is per request provider, then this volatile global field can create race condition for 2 concurrent request. ChatbotAgent is a Singleton class and if this field is mutated per request in processQuery(), then two threads will race on this field. volatile prevents visibility issues but not atomicity across a request's full lifecycle.

devmadhuu · 2026-03-17T14:57:49Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ToolExecutor.java

+  public ToolExecutor(OzoneConfiguration configuration) {
+    // Get Recon base URL from configuration
+    // Default to localhost for local development
+    this.reconBaseUrl = "http://localhost:9888";


This should be taken from OzoneConfiguration with proper URL and port.

devmadhuu · 2026-03-17T15:11:00Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ToolExecutor.java

+    HttpURLConnection conn = null;
+    try {
+      // Connect to the Recon URL
+      conn = (HttpURLConnection) new URL(url).openConnection();


I think , When Kerberos / SPNEGO is enabled on Recon , every call from ToolExecutor will get a 401 Unauthorized. so we should use the existing auth mechanism.

devmadhuu · 2026-03-17T15:19:38Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/llm/AnthropicClient.java

+
+  private static final ObjectMapper MAPPER = new ObjectMapper();
+  private static final String ANTHROPIC_VERSION = "2023-06-01";
+  private static final String ANTHROPIC_BETA_CONTEXT = "context-1m-2025-08-07";


As per my understanding, this is beta header provided by Anthropic that controls 1-million-token context windows, but this seems header for a feature that may not be available yet (today is March 2026, but this header references August 2025 as a point in the past, suggesting it was already published). Shouldn't this be configurable or dynamically determined ?

devmadhuu · 2026-03-17T15:23:41Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/llm/GeminiClient.java

+
+  @Override
+  public List<String> getSupportedModels() {
+    return Arrays.asList("gemini-2.5-pro", "gemini-2.5-flash", "gemini-3-flash-preview", "gemini-3.1-pro-preview");


I feel, these should not be hardcoded as discussed earlier. Two problems, first, we don't know if these models can be discontinued later or renamed in future. Second: These incorrect model names will silently route to the correct provider but fail at the provider with a model-not-found error. Model lists should either be fetched from the provider's API at startup.

devmadhuu · 2026-03-18T04:39:18Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ChatbotAgent.java

+  private static final Logger LOG = LoggerFactory.getLogger(ChatbotAgent.class);
+
+  private static final ObjectMapper MAPPER = new ObjectMapper();
+  private static final Pattern JSON_PATTERN = Pattern.compile("\\{.*\\}", Pattern.DOTALL);


As per my understanding, this regex is used to filter out JSON from first LLM tool call, but this {.*} with DOTALL will match from the first { to the last } in the response. If the LLM returns reasoning text containing any JSON-like fragment before or after the tool call , this regex can return garbage or the wrong object. I would suggest to use following:

a non-greedy match \{.*?\} scoped to the outermost brace level.

devmadhuu · 2026-03-18T04:50:52Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ChatbotAgent.java

+   * Creates the master rules (System Prompt) we send to the LLM during Step 1.
+   * Notice how we teach the LLM exactly what JSON to output!
+   */
+  private String buildToolSelectionPrompt() {


Better to defined these prompts in a resource file and load. May be at same location alongside recon api guide.

devmadhuu · 2026-03-18T04:52:59Z

hadoop-ozone/dist/src/main/compose/ozone/docker-config

+# Enable Recon Chatbot for testing
+OZONE-SITE.XML_ozone.recon.chatbot.enabled=true
+OZONE-SITE.XML_ozone.recon.chatbot.provider=gemini
+OZONE-SITE.XML_ozone.recon.chatbot.gemini.api.key=YOUR_API_KEY_HERE


This is just a user sample key to test it out locally. Better to add comment that this key should not be committed and pushed in git, so by mistake if committed, anyone can use. Also by default, comment out this config in docker config file here.

devmadhuu · 2026-03-18T05:27:12Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/api/ChatbotEndpoint.java

+
+      // Pass the user's question to the Brain (ChatbotAgent) to do all the hard work.
+      // This step takes a few seconds because it talks to Gemini and the Recon APIs.
+      String response = chatbotAgent.processQuery(


Now this whole end to end call is synchronous , so if first LLM tool selection call takes max of 120 secs (timeout) and 2nd LLM call for summarization again with max of 120 secs (timeout), now assume if between these 2 calls, there can be max of 5 Recon API calls (worst case) and each Recon API call can take 30 secs , total time for Recon API calls: 30 x 5 = 150 sec, and total execution time for this whole processQuery can go upto 120 + 150 + 120 ~ 6.5 mins. Not a usual scenario , but it can happen and block all main Recon server threads if 2 or more concurrent calls comes like this. The thread for this HTTP request is blocked for the entire duration. Jetty (which Recon uses) has a fixed thread pool. If 10 users simultaneously send slow queries, 10 threads are blocked from jetty server pool, Once this thread pool exhausted, then all other Recon API for all other pages will not be able to serve requests. I would suggest to use JAX-RS's @Suspended AsyncResponse and use a dedicated thread pool executor for submitting chatbot requests.

devmadhuu · 2026-03-18T05:29:23Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ToolExecutor.java

+  // Hardcoded security timeouts. If Recon takes longer than 30 seconds to connect
+  // or return data, kill the request so we don't freeze the chatbot.
+  private static final int CONNECT_TIMEOUT_MS = 30_000;
+  private static final int READ_TIMEOUT_MS = 30_000;


These are not configurable ? In large clusters where some other Recon APIs are slow due , e.g. large number of unhealthy containers like in millions, 30 seconds may not be enough for Recon to respond, causing a timeout that looks like a chatbot failure when the cluster is actually healthy but just slow. These should be added as proper config keys in ChatbotConfigKeys, consistent with how the LLM timeout is handled.

ArafatKhan2198 changed the title ~~Integrate an AI Assistant to Recon.~~ HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration Mar 13, 2026

HDDS-14816. Add Recon AI Assistant backend foundation with Google Gem…

287c739

…ini integration.

ArafatKhan2198 force-pushed the HDDS-14816 branch from fc5c38b to 287c739 Compare March 13, 2026 08:42

ArafatKhan2198 added 5 commits March 13, 2026 22:06

Fixed Compilation Issues

8871b81

Made Improvements to prompt and code

03c34ff

Refactored code

fbe9140

Improved the comments and refactored the code a bit

66253a4

Simplify LLM Provider architecture using Composition

908bfb5

devmadhuu self-requested a review March 17, 2026 07:12

devmadhuu reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration#9915

HDDS-14816. Add Recon AI Assistant backend foundation with Google Gemini integration#9915
ArafatKhan2198 wants to merge 6 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14816

ArafatKhan2198 commented Mar 13, 2026 •

edited

Loading

Uh oh!

devmadhuu left a comment

Uh oh!

devmadhuu Mar 17, 2026

Uh oh!

devmadhuu Mar 17, 2026

Uh oh!

devmadhuu Mar 17, 2026

Uh oh!

devmadhuu Mar 17, 2026

Uh oh!

devmadhuu Mar 17, 2026

Uh oh!

devmadhuu Mar 18, 2026

Uh oh!

devmadhuu Mar 18, 2026

Uh oh!

devmadhuu Mar 18, 2026

Uh oh!

devmadhuu Mar 18, 2026

Uh oh!

devmadhuu Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArafatKhan2198 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What the Recon AI Chatbot Does (For Context)

Recon AI Chatbot: Backend Architecture Guide - https://docs.google.com/document/d/1jyYZz0llKwN7lexJzVRjxfzl1Iwdbx8dABwLXGzAajQ/edit?tab=t.0

What is the link to the Apache JIRA

How was this patch tested?

How was this patch tested?

1. Configuration Setup

2. API Verification Script

Attached are the test results after evaluating each provider model by asking a question and recording its response - testingChatBot.txt

3. Test Results

Uh oh!

devmadhuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArafatKhan2198 commented Mar 13, 2026 •

edited

Loading