Skip to content

Add image description for grounding#53

Open
aubrypaul wants to merge 5 commits intomainfrom
vision-description-for-grounding
Open

Add image description for grounding#53
aubrypaul wants to merge 5 commits intomainfrom
vision-description-for-grounding

Conversation

@aubrypaul
Copy link
Contributor

No description provided.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Gemini Vision support: images are analyzed to produce text summaries before retrieval (RAG).
    • Public API to customize the vision prompt used for image descriptions.
  • Bug Fixes

    • Improved image MIME type detection and clearer errors for unsupported formats.
    • Ensured image-to-text conversion runs before retrieval and standardized image payload handling.

Walkthrough

Added Gemini Vision scaffolding and an image-to-text conversion flow in src/code.gs: MIME type inference for images, replaced inline_data with inlineData, added _convertImagesToText(currentContents) (invoked before RAG when vector stores exist), top-level vision constants, and a public setPromptForVision(prompt) API.

Changes

Cohort / File(s) Summary
Image Handling & Conversion
src/code.gs
Added MIME type inference in image handling, replaced inline_data with inlineData, implemented _convertImagesToText(currentContents) to detect image parts (inlineData/fileData), call Gemini Vision, replace image parts with text messages, and integrate this step into the run flow when vector stores are present. Duplicate _convertImagesToText declarations noted.
Vision Model & Prompt Constants
src/code.gs
Introduced modelForVision (gemini-3-pro-preview) and promptForVision (default prompt) constants and a public setPromptForVision(prompt) method to customize the vision prompt.
Payload / Gemini Integration
src/code.gs
Switched Gemini payload field names from inline_data to inlineData, added image MIME resolution and errors for unsupported formats, and adjusted payload constructions for image/file parts.

Sequence Diagram

sequenceDiagram
    participant User as User/Client
    participant App as GenAIApp / Chat
    participant Vision as Gemini Vision
    participant RAG as RAG System
    participant API as Gemini/OpenAI API

    User->>App: run() with messages containing images and RAG enabled
    App->>App: Detect image parts (inlineData/fileData)
    App->>Vision: Send image bytes + vision prompt to modelForVision
    Vision-->>App: Return textual image analysis
    App->>App: Remove image parts and append analysis as new user message
    App->>RAG: Send text-only contents for retrieval/augmentation
    RAG->>API: Query external model/store for context
    API-->>RAG: Return retrieved context
    RAG-->>App: Provide augmented context
    App-->>User: Return final response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a pull request description explaining the purpose of the image description feature and how it improves grounding in the GenAI application.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately reflects the main change: adding image description functionality for grounding, which aligns with the image-to-text conversion flow and Gemini Vision integration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch vision-description-for-grounding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/code.gs (1)

130-140: ⚠️ Potential issue | 🔴 Critical

Inconsistent property name: inlineData here vs inline_data in addFile() (line 213).

addImage() uses inlineData (camelCase), but addFile() uses inline_data (snake_case). This causes _convertImagesToText() (line 782) to silently skip images added via addFile(), since it only checks p.inlineData || p.fileData.

Both methods push to the same contents array sent to the Gemini API. Align both to use inlineData, or update _convertImagesToText() to also check inline_data and file_data.

🤖 Fix all issues with AI agents
In `@src/code.gs`:
- Around line 798-810: The hardcoded user prompt inside the descriptionPayload
object biases image analysis to "technical support request"; make this prompt
configurable or neutral by replacing the fixed text in
descriptionPayload.contents[0].parts (where imageParts are spread) with a
parameter or a default general-purpose string (e.g., request-specific prompt
passed into the calling function or a neutral prompt like "Describe the images,
transcribe any visible text, and summarize the visual context.") so callers can
supply domain-specific prompts; preserve the existing generationConfig and
ensure the code still merges imageParts before appending the configurable
prompt.
- Around line 440-445: The code calls this._convertImagesToText(...) whenever
model.includes("gemini") and ragCorpusIds exist, but _convertImagesToText
currently always builds a Vertex AI URL using gcpProjectId which can be empty
when Gemini is configured via setGeminiAPIKey(); update the guard here to only
call _convertImagesToText when either gcpProjectId is non-empty (Vertex AI) or
geminiKey is set (API-key path), or else adjust _convertImagesToText to detect
geminiKey and construct the appropriate Generative Language API endpoint
similarly to the logic used around the model/key handling at lines 462–477;
reference functions/vars: _convertImagesToText, model.includes("gemini"),
gcpProjectId, geminiKey, setGeminiAPIKey.
- Around line 787-789: The check "typeof verbose !== 'undefined' && verbose" is
redundant because verbose is always defined in the GenAIApp IIFE scope; replace
that condition with the simpler "if (verbose)" in the block that logs the
image-to-text message (the console.log inside the image detection branch) to
match other uses of the verbose variable.
- Around line 838-841: The current filter inside newContents.forEach (which
iterates c.parts) only removes parts with camelCase properties
inlineData/fileData and misses snake_case inline_data/file_data used by
addFile(), so update the predicate in the c.parts = parts.filter(...) call to
exclude parts that have any of inlineData, fileData, inline_data or file_data;
also audit addFile()/addImage() usages and prefer unifying on one property name
(e.g., inlineData/fileData) to avoid future mismatches.
- Around line 116-129: The extension matching fails when imageInput is a URL
with query params or fragments; update the MIME-type-detection block (where
mimeType, imageInput, and lower are used) to parse imageInput as a URL first
(using new URL(imageInput) in a try/catch), use url.pathname (or fallback to
imageInput) and run the endsWith checks against that pathname
(png/jpg/jpeg/webp/gif) before falling back to throwing the Error; keep existing
behavior for non-URL inputs and ensure the URL parse errors are handled
gracefully so local filenames still work.
- Around line 825-835: The block that calls UrlFetchApp.fetch and JSON.parse
inside run() can throw and should be wrapped in a try/catch so failures don’t
crash run(); surround the fetch, JSON.parse and the result->description
extraction (references: UrlFetchApp.fetch, JSON.parse, result, description) with
a try/catch, on success keep the existing candidate/parts logic, and on any
error set description to the existing fallback ("Image analysis returned no
text.") and log the error (e.g., Logger.log or console.error) for debugging;
ensure the catch does not rethrow so run() continues gracefully.
- Around line 822-823: The code hardcodes modelForVision
("gemini-3-pro-preview") and uses a Vertex AI-only endpoint string for
generateContent; make the model name configurable (or promote modelForVision to
a module-level named constant) and change endpoint construction to support both
Vertex AI and Generative Language API paths depending on auth: if geminiKey is
present use the Generative Language API endpoint and include the API key in
options.headers, otherwise use the Vertex AI endpoint with OAuth; mirror the
auth branching logic used in _callGenAIApi to set options.headers appropriately
(refer to modelForVision, endpoint, options.headers, geminiKey, and
_callGenAIApi to locate and implement the changes).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/code.gs (1)

138-148: ⚠️ Potential issue | 🟠 Major

Change mime_type to mimeType in both addImage() and addFile() methods.

The Gemini API REST endpoint expects mimeType (camelCase) inside inlineData, not mime_type. While the wrapper property was correctly updated to inlineData, the field name must also use camelCase to match the official Gemini REST API specification.

🔧 Proposed fix
              inlineData: {
-                 mime_type: mimeType,
+                 mimeType: mimeType,
                  data: base64Image
              }

Line 222 in addFile():

            inlineData: {
-             mime_type: fileInfo.mimeType,
+             mimeType: fileInfo.mimeType,
              data: blobToBase64
            }
🤖 Fix all issues with AI agents
In `@src/code.gs`:
- Around line 2395-2397: The setter setPromptForVision currently assigns prompt
directly without validation; ensure prompt is a non-empty string before
assigning to promptForVision (used later in the Gemini API payload around line
811). Add a guard in setPromptForVision that checks typeof prompt === "string"
and prompt.trim().length > 0; if valid, assign promptForVision = prompt.trim(),
otherwise either throw a clear error or ignore the assignment and log a warning
so invalid values (null/undefined/non-strings) are never sent to the Gemini API.
- Around line 788-804: Redundant unreachable guard: remove the imageParts.length
=== 0 check because hasImages already guaranteed images; update the block around
the hasImages and imageParts computations (references: hasImages, imageParts,
currentContents, verbose) by deleting the final conditional that returns
currentContents when imageParts.length === 0, leaving the early return on
!hasImages and continuing with imageParts processing; ensure no other logic
depended on that second guard.
- Around line 848-857: The code handling message parts is inconsistent: in the
newContents.forEach block you use "const parts = Array.isArray(c.parts) ?
c.parts : [c.parts];" which can yield [null] or [undefined], while later you
safely use "c.parts ? [c.parts] : []". Update the forEach in the newContents
transformation (the block that assigns c.parts = parts.filter(...)) to use the
same null-guard pattern — i.e., replace the fallback [c.parts] with c.parts ?
[c.parts] : [] — so both places consistently treat null/undefined parts and
avoid creating arrays containing null/undefined before filtering.
- Around line 830-846: There is a duplicate, unprotected API call: the initial
UrlFetchApp.fetch + JSON.parse for variables response/result should be removed
so only the fetch inside the try/catch runs; keep the endpoint and options
usage, parse the response inside the try block (using the existing result
variable), and ensure description is assigned from result.candidates/... or
result.parts/... as currently written; also remove the redundant const
declarations outside the try and avoid shadowing response/result so the
Logger.log in the catch will handle failures.
- Around line 844-846: The catch block that currently calls Logger.log in the
Gemini Vision preprocessing code should be changed to use console.warn to match
the project's logging conventions; locate the catch handling for "Image analysis
failed during Gemini Vision preprocessing" (where Logger.log is called) and
replace the Logger.log call with console.warn(`[GenAIApp] - Image analysis
failed during Gemini Vision preprocessing: ${error}`) so warnings use
console.warn consistently with other parts of the codebase.
- Around line 2382-2397: The object literal is missing a comma after the closing
brace of setPrivateInstanceBaseUrl which breaks parsing; add a trailing comma
immediately after the brace that ends setPrivateInstanceBaseUrl to separate it
from the next property (setPromptForVision) and ensure the object’s properties
are properly comma-separated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant